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Abstract 

This  chapter  examines  the  issues  related  to  physical  test  validation  for  job  selection.  The  chap¬ 
ter  is  divided  into  three  major  sections.  The  first  examines  issues  and  accepted  methods  of  test  val¬ 
idation.  The  focus  is  on  the  interpretation  of  the  Equal  Employment  Opportunity  Commission 
(EEOC)  guidelines  (EEOC,  1978)  as  they  relate  to  test  validation.  The  sanctioned  validation 
methods  are  content  validity,  criterion-related  validity,  and  construct  validity.  The  measurement 
theory  used  to  evaluate  the  quality  of  employment  tests  is  based  on  the  American  Psychological 
Association  standards  for  validating  educational  and  psychological  tests  (A.P.A,  1985;  A.P.A., 
1987).  A  major  difference  in  physical  test  validation  is  the  use  of  physiological  rather  then  psycho¬ 
logical  tests.  The  second  section  of  the  chapter  examines  the  differences  between  physiological  and 
psychological  test  validation.  The  goal  of  physiological  validation  is  to  define  the  physiological 
capacity  needed  by  a  worker  to  perform  the  work  demanded  by  the  task.  Principal  features  of  the 
physiological  validation  approach  are  the  use  of  a  physiological  metric  to  quantify  test  performance 
and  the  interpretation  of  validity  results  with  relevant  physiological  research  and  theory.  The  final 
section  of  the  chapter  reviews  published  employment  validation  research  on  physical  tests. 


Employment  Seleotion  Tests 


The  principal  guidance  for  the  design  and  implementation  of  selection  tests  for  employment  is 
the  Uniform  Guidelines  on  Employee  Selection  Procedures  issued  by  the  Equal  Employment 
Opportunity  Commission  in  1978  (EEOC,  1978).  These  guidelines  state  that  a  selection  procedure 
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has  “adverse  impact”  if  the  selection  rate  for  any  group  is  less  than  80  percent  for  the  group  with  the 
highest  selection  rate.  Selection  procedures  that  have  adverse  impact  are  considered  discriminatory 
unless  they  can  be  justified.  A  selection  procedure  that  has  adverse  impact  can  be  justified  if — 

1.  The  tests  or  measures  are  derived  firom  a  job  analysis 

2.  The  tests  or  measures  are  indicators  of  critical  or  important  job  duties,  work  behaviors,  or 
work  outcomes 

3.  The  tests  or  measures  have  been  shown  to  be  valid  indicators  of  such  duties,  behaviors,  or  outcomes. 

This  existence  of  a  procedure  for  justifying  selection  tests  is  critical  in  the  area  of  selection  based 
on  physical  abilities.  There  are  weM-recognized  differences  in  physical  abilities  between  genders 
(McArdle,  Katch,  &  Katch,  1996),  and  the  development  of  a  physical  abilities  selection  test  for 
physically  demanding  jobs  mns  a  great  risk  of  having  adverse  impact  across  gender. 

The  nature  of  job  analyses,  identification  of  critical  or  important  job  duties,  and  nature  of  phys¬ 
ical  selection  tests  are  discussed  in  other  sections.  This  section  considers  issues  surrounding  the 
demonstration  of  the  validity  of  selection  tests  or  measures.  As  in  other  sections  of  this  report,  the 
emphasis  is  on  selection  based  on  physical  ability. 

Validity  of  Selection  Tests 

The  extent  to  which  a  test  or  set  of  tests  measures  what  it  is  meant  to  measure  is  called  the 
validity  of  the  test.  For  the  purposes  of  this  chapter,  validity  is  the  accuracy  with  which  selection 
test(s)  measure  important  work  behaviors  (Jackson,  1994).  The  Uniform  Guidelines  recognize 
three  types  of  validity  with  respect  to  selection  test  development:  content  validity,  criterion-related 
validity,  and  construct  validity. 

Content  Validity — That  a  test  has  content  validity  means  that  the  test  items  reflect  important  ele¬ 
ments  of  the  job.  The  job  and  test  content  are  linked.  Most  content-valid  test  items  are,  in  fact,  job 
samples  or  simulations  of  job  tasks.  Theoretically,  for  the  test  as  a  whole  to  be  content  valid,  the  test 
items  must  sample  all  critical  or  important  duties,  work  behaviors,  or  work  outcomes.  For  example, 
if  a  job  has  two  critical,  physically  demanding  tasks,  one  involving  repeated  lifting  to  a  fixed  height 
and  one  involving  carrying  materials  a  long  distance,  both  tasks  need  to  be  simulated  in  the  con¬ 
tent-valid  selection  test.  Such  job  sample  tasks  are  usually  scored  as  to  whether  the  appHcant  can 
or  cannot  perform  the  task.  Additionally,  for  jobs  that  have  time  constraints,  such  as  emergency 
service  tasks,  there  may  be  time  limits  imposed  for  task  completion.  Successful  completion  of  the 
tasks  qualifies  one  for  the  job.  Content-valid  tests  are  the  most  defensible  tests  because  they  are  the 
most  direct  indicators  of  job  performance  capability.  The  closer  the  simulation  is  to  the  actual  job 
task,  the  more  defensible  it  is  as  a  selection  test. 

Criterion- Related  Validity — ^A  test  is  said  to  have  criterion-related  validity  when  the  test  items  are 
shown  to  be  estimators  or  predictors  of  critical  or  important  duties,  work  behaviors,  or  work  out¬ 
comes.  Criterion-related  validity  is  usually  expressed  as  a  correlation  coefficient  between  test  per- 
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formance  (the  predictor)  and  performance  of  an  important  or  critical  job  element  or  behavior  (the 
criterion).  The  criterion  job  element  can  be  any  of  a  number  of  job  behaviors  including  work- task 
performance,  injury  rates  on  the  job,  absenteeism,  or  peer  or  supervisor  ratings.  Criterion-related 
selection  tests  are  not,  by  definition,  direct  indicators  of  the  abibty  to  perform  a  job  or  job  task. 
They  rely  on  a  secondary  relationship  between  the  criterion  task  and  the  predictor  test. 

Two  types  of  criterion-related  validity  can  be  distinguished.  A  test  is  said  to  have  concurrent 
validity  whenever  the  test  is  used  to  predict  a  current  capability.  An  example  is  use  of  a  bench  press 
1-repetition  maximum  (IRM)  is  used  to  predict  an  applicant  s  current  ability  to  lift  a  50-kg  box  to 
elbow  height.  If  the  test  is  used  to  predict  some  future  event,  it  is  said  to  have  predictive  validity. 
An  example  would  be  the  use  of  the  time  to  complete  a  1-mile  run  as  an  indicator  of  future  suc¬ 
cess  in  a  Military  training  program. 

Correlational  studies  are  carried  out  to  demonstrate  criterion-related  validity.  Critical  or  impor¬ 
tant  job  behaviors  are  determined  during  the  job  analysis.  The  nature  of  the  critical  job  behaviors 
usually  suggests  the  nature  of  the  selection  test  to  be  employed.  If  a  critical  task  requires  lifting,  for 
example,  then  selection  tests  that  measure  strength  would  be  appropriate.  Jf  the  critical  job  task 
requires  prolonged  activity,  then  a  test  related  to  endurance,  such  as  a  mn  for  time,  might  be  appro¬ 
priate.  Once  candidate  tests  have  been  chosen,  the  tests  are  administered  to  a  sample  of  workers  or 
another  suitable  sample.  Their  performance  on  the  identified  critical  job  tasks  (or  other  criterion 
measures)  is  also  measured.  The  strength  of  the  associations  between  performance  on  the  selection 
tests  and  performance  on  the  critical  job  behaviors  is  expressed  as  the  correlation  coefficient,  which 
is  a  measure  of  the  amount  of  common  variance  accounted  for  by  two  measures.  If  the  correlation 
coefficient  between  a  selection  test  performance  and  performance  on  a  critical  job  behavior  is  suit¬ 
ably  high,  the  selection  test  may  be  used.  It  should  be  noted  that  there  is  no  standard  for  the  min¬ 
imum  acceptable  correlation  coefficient  between  a  selection  test  and  job  behavior.  Statistical  sig¬ 
nificance  is’  not  always  a  good  indicator  because  with  large  sample  sizes,  a  correlation  that  explains 
only  a  small  part  of  the  variance  can  be  significant.  That  which  is  possible  or  practical  may  drive 
the  selection  of  an  acceptable  level  of  correlation.  As  a  benchmark,  one  might  note  that  a  correla¬ 
tion  coefficient  of  0.707  indicates  that  50  percent  of  the  common  variance  in  the  relationship  has 
been  explained,  but  this  is  difficult  to  use  as  a  criterion  because  many  things  can  affect  the  size  of 
a  correlation  coefficient.  For  example,  the  size  of  correlation  is  influenced  substantially  by  the  vari¬ 
ability  of  the  sample  tested.  It  is  also  possible  to  have  a  high  correlation  but  considerable  errors  in 
prediction  (Altman  6c  Bland,  1983;  Altman  6c  Bland,  1986).  This  subject  is  covered  in  more  detail 
in  another  section  of  this  chapter. 

The  scoring  of  criterion-related  tests  is  based  on  the  achievement  of  critical  performance  levels 
on  the  selection  test(s).  These  critical  performance  levels  can  be  quite  difficult  to  define.  Usually, 
they  are  derived  from  a  mathematical  function  relating  the  predictor  and  criterion  performances. 
The  value  of  the  performance  on  the  selection  test  that  is  associated  mathematically  with  a  critical 
level  of  performance  on  the  important  job  task  is  used  as  the  cut  off  score  or  cut-score  on  the  selec¬ 
tion  test.  This  critical  level  of  job  performance  needs  to  be  identified  in  the  job  analysis.  This  sub¬ 
ject  is  covered  in  more  detail  in  another  section  of  this  chapter. 

Even  in  the  simplest  case,  when  a  single  critical  task  and  critical  level  of  performance,  and  a  sin¬ 
gle  predictor  measure  are  identified,  it  can  be  difficult  to  set  a  critical  level  of  performance.  This  is 
because  the  relationship  between  performance  on  the  selection  test  and  performance  on  the  crite- 
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rion  task  is  not  perfect.  As  an  example,  Figure  5.1  shows  the  relationship  between  the  maximum 
weight  box  that  can  be  lifted  to  elbow  height  (the  work  tasks)  and  IRM  for  arm-curl  (the  criteri¬ 
on  test).  As  one  can  see,  arm-curl  iRM  and  maximum  box  weight  appear  to  be  strongly  related. 
The  correlation  coefficient  for  this  relationship  is  0.875.  Furthermore,  the  relationship  between  the 
variables  appears  to  be  a  straight  line,  as  suggested  by  the  diagonal  line  crossing  the  figure.  This  line 
represents  the  linear  regression  of  maximal  box  lift  weight  with  arm-curl  iRM.  However,  the 
points  are  scattered  about  the  line.  If  the  critical  task  for  a  particular  job  involved  lifting  a  50-kg 
box  to  elbow  height  (the  value  indicated  by  the  horizontal  line),  the  mean  arm-curl  value  associat¬ 
ed  with  this  box  weight  is  23.4  kg  (the  solid  vertical  line).  This,  ideally,  would  be  the  critical  arm- 


Figure  5. 1  Maximum  box  weight  lifted  to  elbow  height  as  a  function  of  arm  curl  1RM 
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curl  IRM  value  that  we  would  pick  if  arm-curl  IRM  were  the  selection  task  for  this  job.  However, 
it  is  clear  by  inspection  of  Figure  5.1  that  some  individuals  who  lifted  less  than  23.4  kg  on  the  arm- 
curl  could  lift  a  50-kg  box.  These  individuals  are  called  “false  negatives”  because  they  failed  the  test 
(and  are  not  selected)  but  can  perform  the  work  task.  In  Figure  5.1,  the  false  negatives  appear  in 
the  upper  left  quadrant  formed  by  the  horizontal  and  vertical  lines  within  the  figure.  Similarly, 
some  individuals  who  lifted  more  than  23.4  kg  could  not  lift  a  50-kg  box.  These  individuals  are 
known  as  “false  positives”  because  they  passed  the  test  (and  were  selected),  but  cannot  perform  the 
work  task.  In  Figure  5.1,  these  individuals  appear  in  the  lower  right  quadrant.  The  Uniform 
Guidelines  allow  the  exercise  of  a  certain  amount  of  judgment  in  setting  cut-scores.  However,  one 
needs  to  have  a  defensible  rationale.  These  issues  are  examined  in  more  detail  in  the  physiological 
validation  section  of  this  chapter. 

Construct  Validity — Construct  validity  is  the  most  indirect  and  theory-driven  method  of  estab¬ 
lishing  validity.  Construct  validity  exists  when  selection  tests  are  related  to  a  general  trait  or  set  of 
characteristics  (the  construct)  that  is  associated  with  successful  accomplishment  of  important  or 
critical  job  behaviors.  The  establishment  of  construct  validity  requires  that  employers  show  that  a 
construct  (a  general  trait  or  set  of  characteristics)  is  required  for  satisfactory  job  performance,  and 
that  the  selection  test  or  tests  measure  this  same  construct. 

Constructs  are  often  developed  using  the  statistical  technique  of  factor  analysis  (Rummel, 
1970).  In  factor  analysis,  a  number  of  correlated  variables  are  reduced  to  a  smaller  muuber  of 
dimensions  or  factors.  Within  the  factor,  each  of  the  included  variables  has  a  coefficient  or  “load¬ 
ing,”  a  numerical  value  indicating  the  strength  of  association  of  that  variable  with  the  factor.  The 
greater  the  loading,  the  greater  the  association  between  the  variable  and  the  factor.  The  factor  is 
defined  mathematically  as  the  sum  of  the  factor  variable  values,  each  multiplied  by  its  loading.  The 
variables  with  the  greatest  loadings  drive  the  theoretical  interpretation  of  the  factor. 

Construct  validity  can  be  established  in  three  ways — 

1.  Performances  on  job  behaviors  can  be  analyzed  to  determine  dimensions  within  the  job. 
Scores  on  selection  tests  can  then  be  shown  to  be  correlated  with  the  job  dimensions. 

2.  Scores  on  selection  tests  can  be  factor  analyzed,  and  dimensions  within  the  selection  tests 
identified.  A  number  of  examples  of  such  analyses  can  be  found  in  the  literature 
(Fleishman,  1964;  Hogan,  1991a;  Meyers,  Gebhardt,  Crump,  &  Fleishman,  1984) 

3.  Factor  scores  ftom  the  dimensions  of  the  selection  tests  can  be  shown  to  be  correlated  to 
performance  on  important  job  behaviors.  Both  potential  selection  test  items  and  per¬ 
formance  on  important  job  behaviors  can  be  factor  analyzed.  A  validity  study  can  then  be 
carried  out  to  analyze  the  associations  between  the  selection  factors  and  the  job  factors. 

These  options  are  indicated  schematically  in  Figure  5.2. 

Figure  5.2  is  an  oversimplified  version  of  the  actual  situation.  Often,  more  than  one  construct 
is  present  in  the  job  behaviors.  For  example,  strength  and  endurance  maybe  required  for  job  suc¬ 
cess.  In  such  a  case,  many  more  relationships  must  be  worked  out  in  the  validity  study. 

The  conduct  of  a  study  to  demonstrate  construct  validity  is  similar  to  that  for  criterion-related 
validity  except  that  instead  of  a  one-to-one  mapping  of  performance  on  a  selection  test  to  perform- 
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Figure  5.2  Three  experimental  designs  for  construct  validity  studies.  Single-ended  arrows  indicate  vari¬ 
ables  included  in  the  factor  analysis.  Double-ended  arrows  indicate  correlations  to  be  measured. 

ance  on  a  job  behavior,  several  selection-test  items  are  measured  that  are  used  to  calculate  factor  scores 
to  represent  the  selection  constructs  being  measured,  and/or  several  job  behaviors  are  measured  to  cal¬ 
culate  factor  scores  to  represent  the  job  constmcts  being  measured.  It  is  these  factor  scores  that  are 
used  in  the  correlational  analysis.  Construct-validity  relationships  are  often  difficult  to  demonstrate 
because  of  the  need  to  identify  the  factor  stmctures  in  the  job  and  selection  tests  and  then  establish 
associations  between  or  among  them.  Given  these  difficulties,  many  employers  choose  to  use  the 
measures  of  underlying  constmcts  directly  as  elements  of  criterion-related  validity  studies. 

Requirements  for  Validity  Studies 

The  Uniform  Guidelines  provide  general  and  technical  standards  for  validity  studies.  Among 
the  general  standards  are  the  following — 

*  In  addition  to  specifying  the  three  types  of  studies  (content,  criterion-related,  and  con¬ 
struct-validity),  the  guidelines  require  the  studies  to  be  consistent  with  applicable  profes¬ 
sional  standards  for  such  research,  accurate  and  free  from  bias. 

*  The  validity  studies  should  be  documented. 

*  The  employer  must  be  prepared  to  justify  the  method  used  to  implement  the  selection  tests. 
If  use  of  a  test  has  greater  adverse  impact  when  used  as  a  ranking  device  than  if  it  were 
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implemented  as  a  simple  pass/fail,  then  the  employer  must  provide  sufficient  evidence  of  the 
validity  and  utility  to  support  use  of  the  test  to  rank-order  participants. 

•  Selection  procedures  may  be  developed  for  higher  level  jobs  in  cases  where  most  of  the 
entry-level  applicants  will  progress  to  those  higher  level  jobs. 

•  An  employer  may  continue  to  use  selection  procedures  for  which  there  is  not  yet  full  validity 
evidence  as  long  as  the  employer  has  evidence  of  the  substantial  validity  of  the  procedures  and 
win  conduct,  when  technically  feasible,  a  study  to  produce  the  additional  evidence  required. 

•  Employers  may  also  use  validity  studies  conducted  by  others  when  it  can  be  shown  that  the 
validity  studies  were  conducted  properly  and  that  the  jobs  perform  substantially  the  same 
major  work  behaviors  for  the  employer  as  for  those  who  conducted  the  study. 

•  Employers,  labor  organizations,  and  employment  agencies  are  encouraged  to  work  togeth¬ 
er  and  cooperate  in  validity  studies. 

•  Finally,  under  no  circumstances  will  the  general  reputation  of  a  test  or  other  selection  pro¬ 
cedures  or  casual  reports  of  its  validity  be  accepted  in  lieu  of  evidence  of  validity. 


The  minimum  technical  standards  called  for  in  the  guidelines  of  all  tests  are  that  validity  stud¬ 
ies  should  be  based"on  review  of  information  about  the  job  (a  job  analysis).  The  technical  standards 
differ  somewhat  for  the  type  of  validation  study.  Tables  5.1, 5.2,  and  5.3  summarize  these  standards 
by  validation  method. 

Table  5. 1  EEOC  Technical  standards  Guidelines  far  the  criterion  validation  method 


Technical  Standard  for  Criterion-Related  Validation  Studies 


1 .  The  study  must  be  technically  feasible.  It  must  be  possible  to  get  an  adequate  sample  size  to  provide  a  scientifically  sound  result. 

However,  an  employer  is  not  required  to  hire  or  promote  individuals  in  order  to  be  able  to  conduct  a  criterion-related  study. 

2.  Whether  the  study  is  to  be  concurrent  or  predictive,  the  sample  subjects  should  be  representative  of  the  individuals  who  might  reasonably  be  expected 
to  fill  the  positions  being  studied. 

3.  In  general,  the  guidelines  Indicate  the  finding  of  a  significance  level  P  <  0.05  to  be  acceptable. 

4.  However,  users  should  evaluate  each  selection  procedure  to  assure  that  it  is  appropriate  for  operational  use.  In  general,  the  greater  the  magnitude  of  the 
correlations  found  between  the  job  behaviors  and  the  tests,  and  the  greater  the  number  of  job  behaviors  predicted  by  a  particular  test,  the  more  appropriate 
it  is  for  implementation.  Selection  procedures  derived  from  studies  with  large  sample  sizes  and  low  correlations,  and  sole  reliance  on  a  selecb'on  instrument 
that  is  related  to  only  one  of  many  critical  job  behaviors  will  be  subject  to  close  review. 

5.  Users  must  avoid  use  of  techniques  that  can  lead  to  inflated  validities  for  selection  procedures.  Examples  include  reliance  on  a  few  selection 
procedures  or  criteria  when  many  were  studied,  and  use  of  the  statisBcs  from  one  sample  when  they  may  not  have  held  up  well  on  cross-validah'on. 

The  Guidelines  recommend  large  samples  and  use  of  cross-validation. 

6.  The  Guidelines  call  for  the  maintenance  of  'fairness*  In  selection  procedures.  Essentially,  unfairness  results  when  members  of  one  group  characteristically 
obtain  lower  scores  on  a  selection  procedure  than  members  of  another  group,  but  the  differences  in  scores  on  the  selection  instrument  are  not  manifest  in 
differences  in  job  performance.  The  guidelines  call  for  investigation  of  the  fairness  of  selection  procedures  whenever  a  selection  device  has  adverse  impact. 
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Table  5.2  EEOC  Technical  standards  Guidelines  for  the  content  validation  method 


Technical  Standard  for  Content  Validation  Studies 


1 .  Consideration  must  be  given  to  the  appropriateness  of  content  validity  strategy.  Such  a  strategy  Is  not  appropriate  when  the  Job  tasks  represent  knowledge, 
skills,  and  abilities  that  an  employee  is  expected  to  learn  on  the  job.  It  is  also  not  appropriate  for  demonstrating  the  validity  of  selech'on  procedures 
that  claim  to  measure  traits  or  constructs  such  as  intelligence,  aptitude,  personality,  common  sense,  judgment,  and  leadership. 


2.  The  job  analysis  must  focus  on  the  Important  work  behaviors,  their  relative  Importance  across  all  behaviors,  and  the  products  of  such  work  behaviors. 
To  be  included  in  a  work  sample,  the  behaviors  must  be  observable,  and  some  aspect  of  them  must  be  measurable.  The  work  behaviors  selected 
for  measurement  should  be  critical  and/or  important  work  behaviors  that  constitute  most  of  the  job. 


3.  To  demonstrate  content  validity  of  a  selection  procedure,  it  must  be  shown  that  the  behaviors  are  a  representative  sample  of  behaviors  of  the  job 
or  that  the  selection  procedure  offers  a  representative  sample  of  the  work  product  of  the  job.  For  selection  procedures  measuring  a  skill  or  ability, 
the  procedures  must  closely  approximate  an  observable  work  behavior  or  work  product.  The  closer  the  content  and  the  context  of  the  selection  tests 
are  to  work  samples  and  work  behaviors,  the  more  suitable  they  are  for  showing  content  validity. 


4.  Whenever  feasible,  measurement  of  the  reliability  of  the  selection  procedures  should  be  carried  out. 


Table  5.3  EEOC  Technical  standards  Guidelines  for  the  construct  validation  method 


Technical  Standard  for  Construct  Validity  Studies* 


1 .  The  Guidelines  recognize  that  establishment  of  construct  validity  is  a  more  complex  strategy  than  either  content  or  criterion-related  validity, 
and  that  there  was,  at  the  time  of  Guidelines'  publication,  a  lack  of  kterature  extending  the  concept  to  empicyment  practices. 

2.  Therefore,  the  job  analysis  must  be  carried  out  in  a  fashion  that  allows  the  identification  of  constructs  underlying  the  important  job  behaviors. 

Each  construct  discovered  should  be  named  and  defined  to  distinguish  it  from  all  other  constructs  so  discovered. 

3.  Selection  procedures  should  then  be  developed  or  identified  that  measure  the  work  behavior  constructs.  The  users  must  then  show  that  the  selection 
procedures  are  related  to  the  work  behavior  constructs  and  that  the  work  behavior  constructs  ate  validly  related  to  the  performance  of  Important  or 
critical  work  behaviors. 

4.  The  Guidelines  allow  limited  use  of  construct  validity  studies.  "Until  such  time  as  professional  literature  provides  more  guidance  on  the  use  of 
construct  r^lidity  in  employment  situations,  the  Federal  agencies  will  accept  a  claim  of  construct  validity  without  a  criterion-related  study. .. 
only  when  the  selection  procedure  has  been  used  elsewhere  in  a  situafion  In  which  a  criterion-related  study  has  been  conducted  and  the  use 

of  a  criterion-related  validity  study  in  this  context  meets  the  standards  for  tianspoitability  of  criterion-related  validity  studies  set  forth  above. . . .” 


’  see  Figure  5.2 


Pliyslologioa!  Vaildation 


The  validation  models  identified  in  the  EEOC  Guidelines  (EEOC,  1978)  are  based  on  the 
American  Psychological  Association  standards  for  validating  educational  and  psychological  tests 
(A.P.A,  1985;  A.P.A.,  1987).  A  major  difference  when  validating  physical  tests  is  the  use  of  physio¬ 
logical,  not  psychological,  tasks.  Physiological  tests  differ  from  educational  and  psychological  tests. 

The  goal  of  physiological  validation  is  to  match  the  worker  with  the  physiological  demands  of 
the  job.  An  essential  element  of  this  process  is  the  quantification  of  the  task’s  physiological  stress. 
The  recent  court  ruling  of  Panning  v.  SEPTA  (U.S.  3"*  Circuit  1999)  gives  legal  support  to  phys¬ 
iological  validation.  The  case  is  discussed  in  greater  detail  in  Chapter  7  of  this  State  of-the-Art 
Report  (SOAR).  A  key  issue  in  the  Panning  v.  SEPTA  case  was  setting  a  valid  aerobic  fitness  cut- 
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score.  The  recommended  cut-score  represented  a  V02max  of  42.5  ml/kg/min.  The  court  ruled  the 
standard  to  be  unacceptable  because  the  test  developers  failed  to  identify  the  minimum  aerobic 
capacity  demanded  by  the  job. 

The  tradition  and  “standard  practice”  used  to  validate  criterion-related  physiological  tests  is  to 
use  the  metric  of  the  dependent  variable  (i.e.,  the  criterion  test)  as  the  basis  for  evaluating  a  sub¬ 
ject  s  work  capacity,  which  is  sampled  with  the  predictor  test.  The  metric  of  the  criterion  variable 
has  physiological  significance.  This  physiological  test  validation  methodology  is  clearly  Ulustrated 
with  body  composition  and  V02max  concurrent  test  validation  research.  To  illustrate,  in  1951, 
Brozek  and  Keys  (1951)  not  only  reported  the  concurrent  validity  coefficient  between  the  predic¬ 
tor  test,  skinfold  fat,  and  the  criterion  variable,  hydrostatically  measured  percent  body  fat,  but  also 
published  the  first  regression  equation  providing  a  valid  model  to  interpret  a  subject  s  skinfold  fat 
measurement  by  the  more  meaningful  metric  of  percent  body  fat.  As  another  example,  the  maxi¬ 
mum  treadmill  test  following  a  standard  protocol  is  a  method  of  measuring  V02max.  These  con¬ 
current  validation  studies  (Bmce,  Kusumi,  Sc  Hosmer,  1973;  Foster,  Jackson,  &  PoUock,  1984; 
PoUock  et  al.,  1976)  published  a  regression  equation  with  functions  to  estimate  V02max 
(ml/kg/min)  from  treadmill  time.  The  metric  used  to  interpret  aerobic  fitness  is  V02max,  not 
elapsed  treadmUl  time.  The  next  section  of  this  chapter  examines  differences  in  the  validation  of 
physiological  and  psychological  tests. 

Differences  in  Physiological  and  Psychological  Test  Validation 

Although  the  psychological-based  validation  strategies  outlined  in  the  EEOC  Guidelines  are 
suitable  for  validating  physical  tests,  there  are  at  least  three  important  differences.  These  include  the 
test  metric  used,  the  work  task  definition,  and  the  matching  of  the  worker  to  the  demands  of  the  task. 

Test  Metric — The  first  major  difference  between  psychological  and  physiological  tests  is  the  test  s 
metric.  Typically,  the  metric  of  physiological  tests  is  a  ratio  measurement  scale.  In  contrast,  scaling  of 
psychological  tests  is  either  ordinal  or  interval.  The  units  of  measurement  of  physiological  tests 
include  percent  body  fat,  oxygen  uptake,  caloric  expenditure,  force  exerted,  pounds  lifted,  weight  load 
transported,  and  various  types  of  power  output,  to  name  a  few.  The  unit  of  measurement  has  physio¬ 
logical  significance.  In  contrast,  the  unit  of  measurement  of  psychological  tests  is  typically  an  indi¬ 
vidual’s  response  on  a  knowledge  test  or  response  to  some  type  of  scale  (e.g.,  Lickert  scale).  The  unit 
of  measurement  on  psychological  tests  is  of  little  importance.  This  is  evidenced  by  the  common  prac¬ 
tice  of  transforming  scores  on  psychological  tests  from  the  original  metric  into  some  form  of  standard 
score  with  a  known  mean  and  standard  deviation,  such  as  500  and  100.  The  persons  score  is  inter¬ 
preted  relative  to  the  mean  and  standard  deviation  of  the  test.  In  contrast,  a  physiological  test  is  not 
only  interpreted  with  the  mean  and  standard  deviation  of  a  population,  but  the  value  can  also  have  an 
important  physiological  meaning.  For  example,  a  V02max  of  20  ml/kg/min  not  only  signifies  a  per¬ 
son  has  low  fitness  by  normative  standards  but  also  indicates  that  the  person  lacks  the  physiological 
capacity  to  perform  work  tasks  with  an  energy  cost  that  exceeds  the  persons  low  aerobic  capacity. 
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Accurate  Quantification  of  Work  Demands — A  characteristic  of  physiological  test  validation  is 
that  the  physical  demands  of  work  tasks  can  often  be  objectively  measured.  This  is  because  of  the 
capacity  to  define  the  physical  demands  of  the  work  task.  Extensive  physiological  research  has 
defined  the  energy  expenditure  of  a  host  of  occupational,  recreational,  and  fitness  tasks  by  measur- 
ing  oxygen  consumption  while  doing  the  tasks  (Dumin  &  Passmore,  1967;  Passmore  &  Dumin, 
19S5).These  energy-cost  tables  are  published  in  basic  exercise  physiology  texts  (Astrand  Sc  Rodahl, 
1986;  Brooks  Sc  Fahey,  1984;  McArdle,  Katch,  Sc  Katch,  1991;  Wilmore  6c  CostiU,  1994).  The 
forces  required  to  “crack”  valves  and  push  or  pull  objects  can  be  measured  with  torque  wrenches  and 
electronic  load  cells  (Jackson,  Osbum,  Laughery,  Sc  Sekuls,  1998;  Jackson,  Osbum,  Laughery,  6c 
Vaubel,  1992).  The  demands  of  materials-handling  tasks  can  be  defined  by  weight  load,  type  of  lift, 
lift  rate,  and  distance  transported  (Jackson,  Osbum,  Laughery,  6c  Young,  1993a;  Waters  et  al., 
1999;  Waters,  Putz-Anderson,  Garg  Sc  Fine,  1993).  These  objective  data  define  the  physiological 
stress  demanded  by  work  tasks. 

Match  the  Worker  to  the  Physiological  Demands  of  the  Task — A  final  difference  between  physi¬ 
ological  and  psychological  test  validation  is  the  capacity  to  match  the  worker  to  the  physiological 
demands  of  the  work  task.  Once  the  demands  of  the  work  task  are  known,  the  next  step  of  a  phys¬ 
iologically-based  validation  strategy  is  to  determine  if  a  worker  has  the  capacity  to  meet  the 
demands  of  the  task.  This  was  the  method  used  to  define  the  minimum  energy  cost  (i.e.,  V02max) 
required  for  fire-fighting  (Sothmann  et  al.,  1990).  This  research  showed  individuals  with  a 
V02max  below  33.5  ml/kg/min  were  unable  meet  the  demands  of  firefighting.  A  goal  of  ergonom¬ 
ic  research  has  been  to  define  the  strength  levels  needed  to  do  industrial  tasks  safely  (Keyserling  et 
al.,  1980;  Keyserling,  Herrin,  6c  Chaffin,  1980).  The  next  sections  of  this  chapter  discuss  these 
methods  in  more  detail. 


Physiological  Validation-Test  Fairness 

The  goal  of  a  physiological  criterion-related  strategy  is  not  only  to  estimate  the  validity  of  the 
test  but  also  determine  the  minimum  physiological  level  required  by  the  task.  A  second  important 
element  of  this  approach  is  the  physiological  interpretation  of  the  obtained  data  analyses. 
Interpretation  of  the  statistical  results  of  validation  research  with  relevant  physiological  theory  and 
published  research  provides  a  scientific  rationale  to  explain  the  results.  Failure  to  do  this  leaves  the 
validation  results  open  to  question. 

An  important  issue  to  resolve  in  a  criterion-related  study  is  whether  the  preemployment  test  is 
fair.  Unfairness  is  defined  as  a  situation  in  which  members  of  a  protected  group  obtain  lower  scores 
on  a  preemployment  test  than  members  of  another  group,  but  the  difference  in  scores  is  not  reflect¬ 
ed  in  differences  in  the  criterion  of  job  performance  (EEOC,  1978).  This  is  called  the  Cleary  test 
of  fairness  and  is  affirmed  by  showing  that  the  regression  line  that  defines  the  relationship  between 
the  preemployment  test  and  the  criterion  is  common  to  both  groups.  The  statistical  procedure  is  to 
test  for  homogeneity  of  regression  slopes  and  intercepts  (Arvey  Sc  Faley,  1988;  Jackson,  1989; 
Pedhauzur,  1997).  The  literature  provides  examples  of  the  use  of  this  test  (Arnold,  Rauschenberger, 
Soubel,  6c  Guion,  1982;  Reilly,  Zedeck,  6cTenopyr,  1979). 
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Although  the  Cleary  test  may  evaluate  the  fairness  of  an  employment  test,  the  analyses  can  also 
provide  a  physiological  interpretation  of  the  employment  test.  The  Cleary  test  is  the  method  of 
determining  whether  a  common  regression  equation  can  be  used  to  explain  the  relationship 
between  the  predictor  and  criterion  tests  of  two  groups.  In  physical  test  validation,  the  two  groups 
are  typically  male  and  female  applicants.  The  data  analysis  strategy  is  first  to  determine  whether 
the  two  groups  share  a  common  regression  slope  and  then  decide  whether  the  groups’  regression 
intercepts  are  within  chance  variation.  Multiple  regression  is  the  statistical  model  used  to  test  for 
fairness.  This  multivariate  analysis  involves  dummy-coding  the  group  variable  (e.g.,  female  =  0, 
male  =  1)  and  forming  a  group  by  predictor  test  interaction  term  (Pedhauzur,  1997).  The  statisti¬ 
cal  strategy  used  is  to  generate  a  full  multiple  regression  consisting  of  the  three  variables — 

1.  a  predictor  test 

2.  a  dummy-coded  group  variable,  and 

3.  an  interaction  term,  which  is  the  product  of  the  group  and  test  variables. 

The  next  step  is  to  generate  two  restricted  regression  models:  the  first  with  two  independent 
variables,  the  group  variable  and  the  predictor  test;  and  second,  with  just  the  predictor  variable.  The 
statistical  test  used  to  evaluate  group  differences  in  slopes  and  intercepts  is  to  evaluate  changes  in 

between  the  full  and  restricted  models.  Pedhauzur  (1997)  outlines  these  statistical  methods  and 
tests  of  significance.  These  methods  are  illustrated  next  with  physiological  data.  Also  shown  are  the 
role  and  importance  of  the  physiological  interpretation  of  the  results. 

Group  Difference  in  Regression  Slopes — A  task  analysis  of  fireight  mover  tasks  showed  that  rap¬ 
idly  moving  packages  firom  a  container  to  a  conveyor  belt  was  a  physically  demanding  task  (Jackson 
et  al.,  1993a).  A  work-sample  test  was  developed  to  duplicate  the  demands  of  this  repetitive  trans¬ 
port  task.  The  task  involved  moving  packages  that  ranged  in  weight  firom  about  15  to  80  pounds. 
The  distribution  of  package  weights  was  representative  of  the  weight  distribution  encountered  by 
workers.  A  work-sample  test  duplicated  work  demands  of  the  task.  Exercise  heart  rate  was  meas¬ 
ured  to  ensure  the  work  rate  of  the  simulation  test  was  representative  of  the  actual  work  rate.  The 
subjects  were  instmcted  to  work  at  a  brisk  rate  consistent  with  their  fitness  and  not  to  move  pack¬ 
ages  that  exceeded  their  capacity. 

Figure  5.3  is  the  bivariate  relationship  between  the  predictor  test  (sum  of  isometric  strength) 
and  the  criterion  test  (materials  transport,  expressed  in  a  metric  of  power  output,  the  pounds  of 
freight  transported  per  minute).  The  data  are  contrasted  by  gender.  Analysis  of  these  data  showed 
that  male  and  female  regression  lines  were  not  parallel.  The  R^  change  between  the  full  model  and 
restricted  model  of  the  strength  test  and  dummy-coded  gender  variable  was  0.04,  which  was  sta¬ 
tistically  significant  (F(i,i99)  =  18.96  p  <  0.01).  The  gjaph  shows  that  the  slope  for  the  female  sub¬ 
jects  (0.534)  is  more  than  twice  as  steep  as  the  slope  for  male  subjetcs  (0.208). 

A  strict  interpretation  of  the  Cleary  test  would  indicate  that  the  strength  test  was  unfair,  but  a 
physiological  interpretation  of  the  data  gives  a  clearer  view.  Post  hoc  examination  of  the  data 
showed  that  many  females  could  not  lift  and  transport  the  heavier  packages.  The  lift  weight  exceed¬ 
ed  their  strength  capacity.  The  steeper  female  slope  showed  that  individual  differences  in  strength 
were  more  important  for  females  than  males.  The  stronger  women  could  lift  the  heaviest  weight 
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loads  while  the  weaker  women  could  not.  A  major  determinant  of  the  female  capacity  to  move 
freight  was'the  subject’s  strength- dependent  capacity  to  lift  heavy  loads.  In  contrast,  most  men  had 
the  physiological  capacity  to  lift  and  transport  the  heaviest  loads.  These  physiological  data  would 
be  important  information  for  setting  a  cut-score  consistent  with  the  demands  of  the  task.  The  data 
could  also  have  important  ergonomic  implications  that  could  lead  to  job  redesign,  such  as  a  com¬ 
pany  policy  limiting  the  weight  of  packages  they  would  transport. 


Intercept  Differences — ^The  second  part  of  the  Cleary  test  is  to  evaluate  differences  in  regression 
intercepts.  Figure  5.4  shows  a  physiological  example  of  intercept  differences  in  the  form  of  the  scat- 
terplot  of  published  male  and  female  body  composition  data  (Jackson  &  Pollock,  1978;  Jackson, 
Pollock,  &Ward,  1980). The  independent  variable  is  the  sum  of  seven  skinfold  measurements,  and 
the  dependent  variable  is  percent  body  fat  measured  by  the  underwater  weighing  method.  The  fig¬ 
ure  shows  that  the  slopes  of  the  male  and  female  regression  lines  are  parallel;  the  differences  in  slope 
are  within  random  variation  (F(i,675)  =  1.25;  p  >  0.05).  The  difference  between  the  full  model  and 
restricted  model  with  gender  and  the  sum  of  skinfolds  was  0.0004.  Adding  the  dummy-coded  gen¬ 
der  variable  to  the  sum  of  skinfolds  accounted  for  more  than  12  percent  of  percent  fiit  variance 
(F(i,67S)  =  398.75;  p  <  0.01).  As  these  data  show,  the  significant  intercept  difference  indicates  that  for 
a  given  score  on  the  predictor  test  (sum  of  skinfold  fat),  the  criterion  score  of  one  group  can  be 
expected  to  be  systematically  higher,  which  in  this  instance  is  measured  percent  body  fat.  The 
regression  lines  differed  by  an  average  percent  body  fat  of  about  6  percent. 
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Figure  5.4  Test  for  fairness,  examp  ie  of  parailel  regression  siopes,  but  significant  differences  in  main  and 
femaie  regression  intercepts 

A  “blind”  application  of  the  Cleary  test  would  indicate  that  the  test  was  unfair.  A  physiological 
interpretation  of  these  results  provides  a  clear  rationale  for  the  intercept  difference.  Skinfold  fat 
measures  subcutaneous  fat,  but  the  body  has  two  types  of  fat,  subcutaneous  and  essential  fat. 
Hydrostatically  determined  percent  body  fat  measures  both  sources  of  body  fat.  It  is  well  estab¬ 
lished  that  the  essential  fat  of  women  is  greater  by  about  7  percent  of  body  mass  than  that  of  men 
(McArdle  et  al.,  1996). The  physiological  explanation  for  the  gender  difference  in  intercepts  can  be 
explained  by  differences  in  essential  fat. 

Although  this  body  composition  example  does  not  represent  a  work-sample  test,  the  use  of 
body  composition  tests  has  been  an  interest  of  Military  researchers  (Marriott,  1992).  It  is  well-doc¬ 
umented  that  percent  fat  is  inversely  related  with  strenuous  tasks  that  involve  moving  the  body. 
This  body  composition  example  shows  that  if  percent  body  fat  is  used  to  evaluate  male  and  female 
performance  on  common  physical  tasks  (e.g.,  running,  climbing),  the  test  must  to  be  expressed  in 
the  physiological  metric  of  percent  body  fat,  not  the  sum  of  skinfold  fat.  In  contrast,  if  the  goal  is 
to  evaluate  fitness  rather  then  the  capacity  to  meet  the  demands  of  a  work  task,  gender-based  stan¬ 
dards  are  appropriate  (Gettman,  1993). 

Common  Slope  and  Intercept — ^The  example  provided  in  this  section  illustrates  the  homogeneity 
of  male  and  female  regression  lines  for  the  predictor  and  criterion  tests.  Figure  5.5  gives  the  scat¬ 
ter  plot  of  the  male  and  female  relationship  between  isometric  strength  and  peak  push  force.  A  task 
analysis  showed  that  push  force  was  a  physically  demanding  task  required  of  workers  who  moved 
freight  containers  (Jackson  et  al.,  1993a).  The  mean  push  force  of  the  males  was  124.6  (SD  =  42.2) 
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Figure  5.5  Test  for  fairness,  exampie  of  homogeneity  of  mate  and  femate  regression  slopes  and  intercepts 


compared  with  a  mean  of  70.0  (SD  =  28.3)  for  the  females.  This  difference  was  statistically  signif¬ 
icant  (F(uo5)  =  99.89-,.p  <  0.01).The  figure  shows  that  the  male  and  female  regression  lines  are  sim¬ 
ilar.  Statistical  analysis  showed  the  slopes  (F(i.2os)  =  1.50;  p  >  0.05)  and  intercepts  (Fcwos)  =  2.00  p  > 
0.05)  of  the  male  and  female  regression  lines  w^ere  not  statistically  significant.  The  group  and 
group-by-strength  variables  accounted  for  less  then  0.1  percent  of  the  push-force  variable.  This 
demonstrated  that  differences  in  the  regression  lines  shown  in  the  figure  were  random  variance. 
This  analysis  demonstrated  that  a  single  regression  line  can  be  use  to  estimate  push  force  firom  iso¬ 
metric  strength,  and  documented  that  the  gender  mean  difference  in  work  task  performance 
depended  on  strength,  not  gender. 


Physiological  Validation— Cut-Score 

Once  the  predictor  test  has  been  shown  to  be  valid,  the  next  step  of  a  physiological  validation 
strategy  is  to  define  performance  on  the  predictor  test  associated  with  the  desired  level  of  per¬ 
formance  on  the  criterion.  An  important  and  often  difficult  part  of  this  analysis  is  defining  the  crit¬ 
ical  level  of  performance  on  the  criterion  variable.  In  some  instances,  a  clear  definition  of  an  essen¬ 
tial  task  is  apparent,  for  example,  lifting  a  75-pound  industrial  valve  from  the  ground  to  the  back 
of  a  truck.  In  other  instances,  the  physiological  demands  of  a  task  can  be  difficult  to  quantify  accu¬ 
rately.  Shoveling  coal  is  a  physically  demanding  task  of  coal  miners  (Jackson  &c  Osbum,  1983),  but 
what  level  of  intensity  and  duration  of  shoveling  are  suitable?  Firefighter  work  simulation  tests  are 
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timed  tests  that  involve  completing  several  firefighter  tasks.  Although  a  firefighter  test  may  be 
clearly  content  valid,  a  more  difficult  phase  of  the  validation  process  is  to  determine  the  time  that 
signifies  successful  fire-fighting  capacity  (Jeanneret  &.  Associates,  1999). 

Regression  models  provide  valid  statistical  methods  of  estimating  physiological  capacity,  with¬ 
in  a  defined  degree  of  accuracy,  from  a  predictor  test  or  combination  of  tests.  Simple  linear  and 
nonlinear  regression  models  are  used  with  a  single  predictor  test,  and  multiple  regression  models 
are  used  with  several  predictor  tests  (Pedhauzur,  1997).  This  is  a  well  established  physiological  test 
validation  method  (ACSM,  1991;  Astrand  &  Ryhming,  1954;  Brozek  Sc  Keys,  1951;  Bruce  et  al., 
1973;  Dumin  Sc  Wormsley,  1974;  Foster  et  al.,  1984;  Jackson,  1990;  Jackson  Sc  Pollock,  1978; 
Jackson  et  al.,  1980;  Pollock  et  al.,  1976).  The  following  provides  regression  examples  of  defining 
physiologically  based  standards  with  continuously  scaled  and  pass/fail  criterion  variables. 

Continuously  Scaled  Criterion — This  first  example  shows  the  use  of  simple  linear  regression  to 
define  the  strength  needed  to  generate  the  push  force  required  by  a  task.  The  job  analysis  (Jackson 
et  al.,  1993a)  showed  that  one  physically  demanding  job  of  freight  workers  was  pushing  or  pulling 
containers  loaded  with  freight.  As  part  of  the  job  analysis,  an  electronic  load  cell  defined  the  peak 
force  required  to  move  freight  containers  that  varied  in  weight.  The  subject  s  peak  push  force  was 
measured  with  an  isometric  push  test  that  simulated  the  position  used  to  push  containers.  Figure 
5.5  shows  the  scattergrams  with  the  male  and  female  regression  lines.  As  shown  earlier,  the  differ¬ 
ence  between  the  slopes  and  intercepts  of  the  male  and  female  regression  lines  were  within  chance 
variation  which  supports  the  fairness  of  using  a  single  regression  line  to  define  this  relationship. 
The  regression  equation  is — 

Push  Force  Regression  Equation  (R  =  0.78,  SEE  =  29.0  lbs)  (1) 

Push  Force  (lbs)  —  2.031  +  (0.198  x  Strength) 

The  regression  equation  provides  a  valid  model  for  defining  the  strength  needed  to  generate  the 
push  force  needed  to  move  containers  of  the  criterion  weight.  Once  this  is  known,  the  strength 
associated  with  this  push  force  can  be  determined. To  illustrate,  assume  the  criterion  push  force  was 
defined  to  be  100  pounds  of  force.  The  regression  equation  shows  that  a  strength  score  of 495  esti¬ 
mates  a  push  force  of  100  pounds. 

The  goal  of  a  physiological  model  of  validation  is  to  define  the  minimum  physiological  capacity 
demanded  by  the  work  task.  The  regression  model  provides  empirical  evidence  to  define  a  physio¬ 
logically  defined  cut-score  within  a  defined  level  of  probability.  Although  physiological  tests  scores 
typically  yield  higher  criterion-related  validity  coefficients  then  psychological  tests,  they  stiU  have 
substantial  prediction  errors.  Figure  5.6  shows  the  predictor  errors  associated  with  the  push  force 
task.  Provided  is  an  Altman-Bland  plot  (Altman  8c  Blaud,  1983;  Altman  Sc  Blaud,  1986)  of  the  push 
force  data  estimated  from  isometric  strength  (see  Figure  5.5).  The  Altman-Bland  method  plots  the 
difference  between  the  residual  scores  (Y  -  Y’  which  is  measured  estimated  push  force)  by  the  aver¬ 
age  of  measured  and  estimated  push  force.  Although  the  correlation  between  the  criterion,  push 
force,  and  predictor,  isometric  strength,  was  high,  0.78,  the  Altman-Bland  plot  shows  that  defining 
the  physiological  criterion  is  not  error  free.  The  variability  on  the  Y  axis  is  defined  by  the  standard 
error  of  estimate  of  the  regression  analysis,  which,  in  this  example,  is  29  pounds  of  push  force. 
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Figure  5. B  Reprinted,  by  permission,  from  Altman,  D.  G.,  BlandmJ.  M.  (Altman-Bland  plot  of  prediction 
residuals  (measured  -  estimated)  contrasted  by  the  average  of  measured  dnd  estimated  maximum  push 
force),  pp.  307-310,  ©  by  the  Lancet  Ltd.,  1988. 

Because  the  correlation  between  a  predictor  variable  and  the  criterion  test  is  always  less  than  1, 
there  will  always  be  prediction  errors.  The  standard  error  of  estimate  provides  an  estimate  of  the 
variation  in  prediction  error.  Although  it  is  not  possible  to  define  an  exact  physiologically-based 
cut-score,  it  is  possible  to  define  a  standard  with  a  defined  degree  of  probability.  The  regression 
equation  (Equation  1)  provides  a  valid  model  that  defines  the  relationship  of  strength  with  push 
force.  As  shoAvn  earlier,  495  pounds  is  associated  with  a  push  force  of  100  pounds.  Because  the  cor¬ 
relation  between  the  two  tests  is  less  than  perfect  and  there  are  prediction  errors,  only  50  percent 
of  subjects  with  495  pounds  of  strength  would  be  expected  to  have  the  capacity  to  generate  100 
pounds  of  push  force.  The  regression  models  standard  error  of  estimate  can  be  used  to  define  the 
probability  that  someone,  with  a  given  level  of  strength,  would  meet  the  physiologically  based  stan¬ 
dard.  Figure  5.7  shows  the  relationship  between  level  of  isometric  strength  and  probability*  of  being 
able  to  generate  100  pounds  of  push  force.  The  probability  estimates  provide  additional  data  that 
can  be  used  to  define  a  physiological  criterion  that  is  congruent  with  the  criticality  of  the  task,  and 
the  mission  and  unique  organizational  characteristics. 

Pass-Fail  Model — Often,  the  criterion  of  job  performance  is  scaled  as  a  dichotomous  variable.  For 
example,  manual  lifting  tasks  are  scored  pass  or  fail — the  applicant  could  or  could  not  lift  a  given 
weight  load  (Jackson,  Osbum,  Loughery  &  Sekula,  1998;  Jackson  et  al.,  1992).  Other  examples  are 
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Figure  5.7  Probability  of  being  able  to  generate  100  pounds  for  push  force  for  levels  of  strength 

endurance  tasks  at  a  constant  power  output.  A  manufacturing  work  task  may  require  a  worker  to 
repetitively  lift  and  transport  weight  loads  at  a  given  work  rate  governed  by  production  speed. 
Individuals  without  sufficient  physiological  capacity  would  not  be  able  to  maintain  the  set  pace.  A 
task  docurnented  that  refinery  workers  must  close  industrial  valves  during  emergencies  (Jackson, 
1987;  Jackson  et  al.,  1992;  Osbum,  1977).  For  some  individuals,  the  task  exceeded  their  physiolog¬ 
ical  capacity  and  they  fatigue  quickly.  For  others,  the  task  was  within  their  physiological  capacity. 
These  fit  individuals  could  continue  work  for  extended  periods  of  time.  Demanding  repetitive  tasks 
at  a  set  power  output  tend  to  produce  a  bimodal  distribution — those  who  have  and  those  who  do 
not  have  the  physiological  capacity.  This  is  illustrated  in  the  literature  (Jackson  et  al.,  1992). 

Logistic  regression  analysis  (Hosmer  6c  Lemeshow,  1989;  Pedhauzur,  1997)  provides  a  model 
to  physiologically  validate  tests  when  the  criterion  is  a  dichotomous  variable.  Logistic  regression, 
like  multiple  regression,  can  use  a  single  independent  variable  or  several  independent  variables.  A 
logistic  regression  model  estimates  the  probability  of  group  membership  (e.g.,  criterion  variable  of 
pass  or  fail)  given  a  score  or  scores  on  the  predictor  variable  (Pedhauzur,  1997).  A  public  health 
landmark  multiple  logistic  regression  validation  study  was  with  the  Framingham  heart  study 
(Kannel,  McGee,  6c  Gordon,  1976).  The  research  objective  was  to  identify  and  quantify  cardiovas¬ 
cular  disease  risk  factors.  The  logistic  analysis  not  only  established  that  cholesterol,  blood  pressure, 
glucose  intolerance,  and  smoking  were  independent  cardiovascular  disease  (CVD)  risk  factors,  the 
statistical  analysis  also  produced  an  equation  with  a  function  of  estimating  the  probability  of  CVD 
risk  for  combinations  of  risk  factors.  Logistic  regression  analysis,  like  regression  models  with  con¬ 
tinuous  variables,  establishes  the  validity  of  the  independent  variable(s)  and  provides  an  empirical 


Human  Systems  lAC  SOAR,  2000 


155 


model  for  defining  the  probability  of  group  membership.  The  application  of  simple  logistic  regres¬ 
sion  analysis  is  illustrated  below  with  a  lifting  task. 

A  task  analysis  of  an  oil  production  plant  showed  that  lifting  heavy  valves  firom  the  floor  to 
knuckle  height  was  an  important,  physically  demanding  work  task  (Jackson,  1998).  A  work-sam¬ 
ple  test  was  developed  to  simulate  the  task.  The  work-sample  test  involved  lifting  several  loads  that 
varied  in  weight.  The  physical  dimensions  of  the  lift  duplicated  the  work  task.  The  test  was  scored 
pass  or  fail  depending  on  the  subjects  ability  to  complete  the  lift.  The  predictor  test  was  the  sum 
of  four  isometric  strength  tests,  arm,  shoulder,  torso,  and  leg  strength.  The  goal  of  this  physiologi¬ 
cal  validation  was  to  define  the  level  of  strength  required  for  the  lift  task. 

This  validation  method  is  illustrated  with  three  weight  loads,  60-,  90-,  and  120-pound  lifts. 
These  weights  represent  industrial  lifts  ranging  from  moderately  heavy  to  very  difficult.  The  first 
step  in  this  analysis  was  to  determine  whether  lift  success  depended  on  strength.  Table  5.4  provides 
the  means,  standard  deviations,  and  sample  sizes  of  the  subjects  who  passed  and  failed  the  lift. 
Analysis  of  variance  showed  that  lift  success  depended  on  strength  and  documented  three,  expect¬ 
ed  trends.  First,  the  number  of  individuals  who  could  lift  the  load  decreased  with  the  weight  load. 
Next,  the  Analysis  of  Variance  (ANOVA)  documented  that  lift  success  for  all  three  weights 
depended  on  isometric  strength.  The  means  for  those  who  lifted  the  weight  were  significantly 
higher  than  for  those  who  could  not.  Third,  the  mean  strength  of  those  who  completed  the  lift 
increased  with  the  weight  load.  These  trends  are  consistent  with  physiological  expectations. 


Table  5.4  Sample  sizes,  strength  means  and  standard  deviations,  and  analysis  of  strength  differences  of 
those  who  could  and  could  not  lift  the  weight 


Lift  Weight 

Lifted  Weight 

Did  Not  Lift  Weight 

ANOVA 

F-ratio 

N 

MtSO 

N 

M±SD 

60-Pound 

120 

518  ±197 

16 

196  ±66 

41.92* 

90-Pound 

93 

579  ±175 

43 

233  ± 101 

118.89* 

120-Pound 

71 

644  ±141 

65 

301  ± 108 

250.69* 

*P<0.0001 


Figure  5.8  provides  a  scatter  plot  of  the  subjects’  strength  data  contrasted  with  their  90-pound 
lift  success.  This  plot  shows  the  group  difference  in  strength  documented  by  the  ANOVA  but  also 
shows  an  overlap  in  the  strength  of  those  who  passed  and  failed  the  lift.  Logistic  regression  analy¬ 
sis  provides  a  model  for  estimating  the  probability  of  success  on  the  criterion  variable  (i.e.,  lifting 
the  load)  for  given  levels  on  the  predictor  test  (i.e.,  strength)  or,  in  this  example,  the  probability  of 
being  able  to  lift  the  load  for  a  level  of  strength.  The  logistic  regression  analysis,  which  agreed  with 
the  ANOVAs  (Table  5.1),  showed  that  the  regression  weight  for  strength  was  significantly  related 
to  the  probability  of  hfting  the  given  weight.  The  equations  for  the  three  lift  loads  are — 

60-pound  lift  (2) 

Logit{P)  =  (0.020  x  Strength)  —  3.926 
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90-pound  lift 

Logit{P)  =  (0.017  x  Strength)  —  5.689 


(3) 


120-pound  lift  (4) 

Logit{P)  =  (0.023  x  Strength)  —  10.334 


Figure  5.8  Scatterplot  of  strength  test  of  subjects  who  could  or  could  not  complete  a  90-pound  lift  from  floor 
to  knuckle  height 


Once  the  logistic  equation  is  defined,  Equation  5  estimates  the  probability  of  success  (Pedhauzur, 
1997).  The  term  e  in  Equation  5  is  the  base  of  the  natural  logarithm;  a  value  of  Y  2.718.  Figure  5.9 
graphically  shows  the  probabihty  of  success  in  completing  the  lift  for  strength  levels. 

Logistic  Probability  Calculation  Model  (5) 

P  =  (ig^)  X  100 

The  logistic  probability  curves  clearly  show,  as  would  be  physiologically  expected,  that  the 
strength  needed  to  lift  the  load  increases  as  the  lift  gets  heavier.  There  is  a  50  percent  probability, 
for  example,  that  someone  with  200  pounds  of  strength  could  lift  a  60-pound  load.  In  contrast,  only 
10  percent  of  the  subjects  with  200  pounds  of  strength  would  be  expected  to  lift  90  pounds.  The 
likelihood  of  someone  with  200  pounds  of  strength  lifting  120  pounds  is  0.  The  physiological  lev¬ 
els  needed  to  be  50  percent  confident  of  lifting  the  90-  and  120-pound  loads  are  about  350  and  450 
pounds  of  strength,  respectively. 
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Figure  5.9  Logistic  curves  of  the  probabitity  of  being  abfe  to  iift  the  weight  bad  as  a  function  of  iifter  strength 

Physiological  Validation— Matching  the  Worker  to  the  Job 

The  goal  of  physiological  test  validation  is  to  select  workers  with  the  capacity  to  meet  the 
demands  of  the  job.  This  is  consistent  with  ergonomic  objectives  designed  to  reduce  the  risk  of  job- 
related  injuries  (Ayoub,  1982).  As  has  been  shoAvn  in  this  chapter,  the  statistical  models  used  to 
define  the  physiological  stress  of  the  task  are  less  then  absolute.  This  permits  latitude  in  formulat¬ 
ing  physical  cut-scores  ranging  from  lenient  to  rigorous.  The  regression  statistics,  equations,  and 
standard  errors  provide  an  empirical  base  for  making  the  decision. 

Although  the  regression  models  previously  discussed  can  help  define  the  degree  of  physiologi¬ 
cal  stress,  the  difficult  task  of  establishing  a  suitable  cut-score  for  a  criterion  remains.  The  types  of 
job  performance  criteria  listed  in  the  Uniform  Guidelines  that  may  be  suitable  are  supervisory  rat¬ 
ings,  production  rate,  error  rate,  tardiness,  absenteeism,  and  success  in  training.  According  to  the 
Guidelines,  this  is  not  an  inclusive  list  of  criteria.  Other  examples  of  criteria  used  to  validate  phys¬ 
ical  tests  include  accidents  (Reilly  et  al.,  1979),  field  performance  (Reilly  et  al.,  1979),  injury  rates 
(Gilliam  8c  Lund,  2000;  Keyserling  et  al.,  1980;  Keyserling  et  al.,  1980);  lost  time  due  to  sickness 
or  injury  (Rayson  et  al.,  2000a;  Rayson  et  al.,  2000b);  and  job-related  work  tasks  (Arnold  et  al., 
1982;  Jackson,  Osbum,  8c  Laughery,  1998;  Jackson,  Osbum,  8c  Laughery,  1984;  Jackson  et  al., 
1992;  Jackson,  Osbum,  8c  Laughery,  1991;  Jackson,  Zhang,  Laughery,  Osburn,  8c  Young,  1993b; 
Rayson,  2000a;  Rayson,  2000b). 
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A  crucial  element  of  any  evaluation  strategy  is  the  selection  rate  of  a  protected  group,  which,  in 
physical  testing,  is  females.  The  physiological  validation  method  supplements  the  process  of  defining 
an  appropriate  cut-score  approach  with  scientific  evidence.  This  validation  approach  seeks  to  find  the 
minimum  physiological  level  demanded  by  the  task.  The  Uniform  Guidelines  (EEOC,  1978)  allow 
the  use  of  rational  judgment  in  setting  a  valid  cut-score.  An  objective  of  the  physiological  validation 
process  is  to  provide  a  scientific  explanation  of  the  validation  results.  Included  in  this  process  is  the 
establishment  of  a  sound  cut-score.  Receiver  operator  characteristic  (ROC)  anal3^is  (HuHey,  1988)  is 
one  method  used  to  establish  physiological  cut-scores.  It  supplements  the  regression  results  by  defin¬ 
ing  a  cut-score  consistent  with  a  strategy  of  maximizing  either  test  sensitivity  or  specificity. 

A  ROC  is  a  graphic  analysis  used  to  establish  a  trade-off  between  test  sensitivity  and  specifici¬ 
ty.  If  the  goal  is  to  maximize  test  sensitivity,  the  proportion  of  true  positives  (i.e.,  those  who  can 
meet  the  physiological  demands  of  the  work),  the  ROC  would  be  a  plot  of  test  sensitivity  by 
1  —  specificity,  which  is  the  proportion  of  false  positives.  False  positives  are  those  identified  by 
the  test  with  the  physiological  capacity  to  meet  the  demands  of  the  task  but  who  cannot  meet  the 
demands.  In  this  context,  the  ROC  curve  provides  a  rational  method  of  selecting  a  cut-score  based 
on  a  balance  between  high  sensitivity  and  low  specificity.  The  interested  reader  is  directed  to  anoth¬ 
er  source  (Wellens  et  al.,  1996)  for  the  application  of  ROC  analysis  for  establishing  a  physiologi¬ 
cal  cut-score.  The  objective  of  that  study  was  to  find  the  body  mass  index  (ratio  of  weight  and 
height)  that  defined  the  obesity  levels  of  25  percent  and  33  percent  body  fat  content,  determined 
hydrostatically,  for  men  and  women,  respectively. 

Several  factors  are  considered  when  establishing  physiologically  based  cut-scores.  The  following  is  a 
nonexhaustive  list  of  conditions  that  may  determine  \sfiether  a  lenient  or  rigorous  cut-score  is  selected — 

•  Adverse  Impact — The  first  concern  is  adverse  impact.  Consideration  must  be  given  to  the 
nuiriber  of  the  protected  group  that  the  standard  screens  out. 

•  Risk  of  Injury — Subjecting  workers  to  physical  demands  increases  the  risk  for  work-related 
injuries.  Numerous  studies  (Cady,  BishoflF,  O’Connell,  Thomas,  &c  Allan,  1979;  Gilliam  & 
Lund,  2000;  Herrin,  1986;  Keyserling  et  al.,  1980;  Liles  et  al.,  1984;  Snook,  CampaneUi,  6c 
Hart,  1978;  Snook  6c  CirieEo,  1991)  show  that  the  risk  of  musculoskeletal  injury  increases 
as  the  demands  of  the  task  approach  the  worker’s  maximum  physiological  capacity. 

•  Physiological  Interpretation  of  the  Validation  Results — An  important  element  of  a  physi¬ 
cal  test  validation  study  is  to  establish  the  congruence  among  the  validation  results,  pub¬ 
lished  research,  and  physiological  theory.  It  is  critical  to  provide  a  sound  physiological  expla¬ 
nation  of  the  validation  results.  Failure  to  be  able  to  interpret  the  results  by  accepted  aca¬ 
demic  standards  leaves  the  decision  open  to  question. 

•  Environmental  Conditions — Often,  the  location  at  which  the  validation  study  is  conduct¬ 
ed  will  be  different  from  the  work  environment.  For  example,  firefighter  tests  are  not 
administered  in  burning  buildings,  the  source  of  demanding  work.  Environmental  condi¬ 
tions  (e.g.,  heat)  that  increase  the  demands  of  the  task  justify  more  rigorous  standards. 

•  Workforce  Numbers — ^The  number  of  workers  available  at  the  work  site  can  affect  the  rigor 
of  a  cut-score.  A  more  lenient  standard  might  be  considered  when  several  workers  are  avail¬ 
able  to  do  the  work.  Although  a  lenient  selection  standard  would  increase  the  probability 
that  a  worker  cannot  meet  the  most  physical  demands  of  the  job  (i.e.,  a  false  positive),  it  may 
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not  be  a  serious  problem  if  others  are  available  to  do  the  work.  The  stronger  workers  can 
help  with  the  most  demanding  tasks.  In  contrast,  a  more  rigorous  standard  might  be  con¬ 
sidered  if  a  worker  does  not  have  help. 

•  Criticality  of  the  Job — ^In  some  jobs,  the  failure  to  meet  the  demands  of  a  job  can  be  dan¬ 
gerous.  The  dummy  drag  test  is  a  common  item  of  a  preemployment  firefighter  test.  This  is 
a  critical  task  because  the  inability  to  perform  it  successfully  can  be  life  threatening. 

•  Workforce  Productivity — Selecting  workers  with  a  higher  physiological  capacity  can 
increase  an  organization’s  productivity.  The  data  in  Figure  5.3  show  that  the  amount  of 
freight  a  worker  was  capable  of  moving  was  related  to  the  worker’s  strength  capacity.  This  was 
one  of  the  factors  considered  by  a  freight  company  to  initiate  a  preemployment  test  program.^ 


Published  Validation  Studies 

Although  many  preemployment  tests  have  been  completed,  most  are  not  in  the  published  lit¬ 
erature.  The  completed  validation  study  often  is  a  technical  report  to  the  governmental  agency  or 
private  company  that  funded  the  project,  and  many  organizations  consider  these  privileged.  Hogan 
(1991b)  provides  an  extensive  list  of  these  unpublished  reports.  The  following  sections  summarize 
the  published  validation  research.^ 


Outside  Craft  Jobs 

One  of  the  first  published  concurrent  validation  studies  was  for  outdoor  telephone  craft  jobs 
that  involved  pole-climbing  tasks  (Bemauer&Bonanno,  1975;  Reilly  et  al.,  1979).  The  issues  lead¬ 
ing  to  the  development  of  this  study  were  the  large  differences  between  male  and  female  workers 
in  turnover  and  accident  rates.  After  6  months,  43  percent  of  the  women  left  the  outdoor  craft  jobs 
compared  with  only  8  percent  of  the  males.  More  important,  women  sustained  substantially  more 
injuries  than  men  fi:om  falls  while  climbing  or  working  on  poles. 

An  extensive  job  analysis  showed  that  pole  climbing  was  an  essential,  physically  demanding 
work  task.  Bemauer  and  Bonanno  (1975)  evaluated  the  factor  composition  of  40  tests  and  anthro¬ 
pometric  measures  on  a  sample  of  241  job  applicants. They  developed  a  six-item  battery  consisting 
of  reaction  time,  grip  strength,  percent  body  fat,  step  test  performance,  balance,  and  sit-ups.  They 
found  that  the  balance  and  step  tests  significantly  differentiated  successful  from  unsuccessful  stu¬ 
dents  enrolled  in  pole-climbing  school. 

Reilly  and  associates  (Reilly  et  al.,  1979)  extended  this  work  by  completing  two  concurrent  val¬ 
idation  studies.  In  the  first  experiment,  several  anthropometric  and  physical  performance  tests  were 
administered  to  83  male  and  45  female  candidates  for  outdoor  telephone  craft  jobs.  Two  validation 
criteria  were  used  in  this  experiment.  The  first,  general  task  performance,  was  the  average  of  two 
supervisor  performance  ratings  of  the  candidate’s  performance  during  the  5-day  pole-climbing 
school.  Job  analysis  data  were  used  to  construct  the  rating  scale.  The  second  criterion  was  a 
dichotomy  of  those  who  were  on  the  job  6  months  after  placement  and  those  who  were  not.  Using 
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the  criterion  of  general  task  performance,  stepwise  multiple  regression  isolated  a  three-predictor 
battery  consisting  of  dynamic  arm  strength,  reaction  time,  and  Harvard  bench  step  time.  The 
analysis  yielded  a  multiple  correlation  of  0.45.  The  statistically  significant  zero-order  correlations 
between  the  job  tenure  criterion  and  these  tests  were  dynamic  arm  strength,  0.36;  reaction  time, 
0.19;  and  bench  step  time,  0.18.  Further  analysis  showed  that  a  common  regression  line  defined 
male  and  female  performance  that  met  the  important  criteria  of  job  fairness. 

The  second  experiment  used  a  larger  sample  of  employees  who  represented  the  whole  compa¬ 
ny.  The  criterion  of  pole-climbing  training  success  was  changed  to  be  consistent  with  changes 
introduced  in  the  pole-climbing  course.  The  second  study  included  four  different  criterion  meas¬ 
ures  of  job  performance — 

1.  time  to  complete  the  pole-climbing  school, 

2.  completion  of  pole-climbing  school  (a  number  withdrew  from  the  course), 

3.  field  observations  of  pole-climbing  proficiency,  and 

4.  accidents  for  6  months  after  entering  outdoor  craft  work. 

The  second  sample  consisted  of  78  female  and  132  male  pole-climbing  school  applicants. 

Multiple  regression  selected  a  three-item  battery  consisting  of  body  density  estimated  from 
skinfbld  fat,  balance,  and  an  isometric  arm  strength  test.  The  criterion  was  time  to  complete  the 
course.The  significant  correlations  among  the  three  tests  and  the  four  criteria  were  time  to  com¬ 
plete  the  course,  0.46;  training  dropout,  0.38;  field  observations  for  the  female  sample,  0.53;  and 
accidents,  0.15.  Further  analysis  showed  that  the  same  regression  equation  was  equally  valid  for 
both  males  and  females. 

Firefighters 

Nearly  all  major  fire  departments  have  a  physical  ability  preemployment  test  (Tandy  & 
Investigator,  1992).  Considine  and  associates  (Considine  et  al.,  1976)  published  the  first  physical 
test  battery  for  screening  firefighter  applicants.  The  test  battery  evolved  from  an  occupational  task 
analysis  that  surveyed,  rated,  and  analyzed  81  tasks  performed  by  firefighters.  The  authors  select¬ 
ed  a  construct  validation  strategy.  The  constructs  identified  through  the  task  analysis  were  dynam¬ 
ic  strength,  static  strength,  agility,  total  body  coordination,  cardiorespiratory  endurance,  muscular 
endurance,  eye-hand  coordination,  and  total  body  speed. 

The  sample  of  the  first  study  consisted  of  191  males  who  were  tested  on  body  composition 
measures,  general  physical  performance  tests,  and  eight  job  sample  tests.  A  factor  analysis  of  these 
data  produced  three  general  factors.  The  factor  names  and  tests  representing  each  factor  were  fac¬ 
tor  1,  the  ability  to  handle  the  body  weight  measured  by  percent  body  fat,  obstacle  run,  and  flexed- 
arm  hang;  factor  2,  muscle  power  measured  by  the  hose  hft,  man-lift-and-carry,  and  stair  climb 
work  sample  tests;  and  factor  3,  body  structure  measured  by  fat-free  weight  and  height. 

A  major  purpose  of  the  second  study  was  to  analyze  the  test  battery  for  racial  bias.  Based  on  the 
results  of  the  first  study,  nine  tests  were  administered  to  165  firefighters  and  19  candidates.  Data 
analysis  showed  that  African-American  and  white  subjects  did  not  differ  on  any  of  the  tests.  These 
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data  were  factor  analyzed  producing  three  common  factors.  The  final  recommended  battery  con¬ 
sisted  of  four  work  sample  tests,  and  one  fitness  test;  the  £lexed-arm  hang.  The  work  sample  tests 
were  modified  man-lift-and-carry  that  simulated  rescuing  a  trapped  victim;  stair  climb  that  simu¬ 
lated  climbing  the  stairs  in  a  building;  obstacle  mn  that  simulated  moving  the  body  through  con¬ 
fined  spaces;  and  hose  couple  that  involved  coupling  three  hoses  to  a  hose  couple. 

Davis  and  associates  (Davis,  Dotson,  &  SantaMaria,  1982)  examined  the  relationship  between 
simulated  firefighting  tasks  and  physical  performance  measures.  The  sample  consisted  of  100  ran¬ 
domly  selected  men  from  the  population  of  Washington,  DC,  firefighters.  The  physical  performance 
measures  included  body  composition,  general  fitness,  aerobic  fitness,  and  cardiovascular  variables.  The 
five  work-sample  tests  came  from  the  job  analysis  of  firefighter  work  tasks  and  involved  handling  a 
ladder,  lifting  and  transporting  a  33.1-kilogram  load  up  five  flights  of  stairs,  puUing  a  23.5-kilogram 
hose  roll  from  the  grotmd  up  to  and  through  the  fifth-floor  window,  carrying  and  dragging  a  53-kilo- 
gram  dummy  down  five  flights  of  stairs,  and  using  a  sledge  hammer  to  simulate  forceful  entry. 

Canonical  correlation  showed  that  two,  independent  dimensions  defined  the  relationship 
between  the  physical  performance  variables  and  firefighter  work- sample  tests.  The  first  canonical 
dimension  (Rc  =  0.79)  represented  a  physical  work  capacity  factor  that  reflected  the  muscular 
strength  and  endurance,  and  maximal  aerobic  capacity  elements  of  the  simulated  work-sample 
tests.  The  second  dimension  (Rc  =  0.63)  represented  a  resistance  to  fatigue  factor  and  the  ability  to 
complete  the  work  tasks  quickly.  Multiple  regression  selected  two  physical  performance  batteries 
(laboratory  and  field  batteries)  to  estimate  each  work-sample  dimension.T'he  field  test  battery  for 
the  physical  work  capacity  factor  consisted  of  push-ups,  sit-ups,  and  grip  strength.  The  validity  of 
the  field  battery  (R  =  0.73)  was  lower  than  the  five-item  laboratory  battery  (R  =  0.95)  that  added 
submaxiraal  oxygen  pulse  and  maximum  heart  rate  to  the  battery.  The  three-item  field  test  of  the 
second  factor  included  estimated  percent  body  fat,  lean  body  weight,  and  V02max  estimated  with 
a  step  test  (R  =  0.77).  The  laboratory  test  added  maximum  heart  rate  and  treadmill  performance 
and  increased  the  validity  (R  =  0.89)  of  the  resistance  to  fatigue  work  sample  factor. 

The  physiological  response  of  fire  fighting  has  been  the  focus  of  many  investigators.  Exercise 
heart  rate  responses  elicited  by  simulated  and  actual  firefighting  tasks  confirmed  that  these  tasks 
have  a  significant  cardiovascular  effect  (Barnard  &c  Duncan,  1975;  Davis  8c  Convertino,  1975; 
Lemon  8c  Hermiston,  1977;  Maiming  8c  Griggs,  1983;  O’Connell, Thomas,  Caddy,  8c  Karwasky, 
1986;  Sothmann,  Saupe,  Jasenor,  8cBlaney,  1992).  In  a  study  during  actual  fire-suppression  emer¬ 
gencies,  Sothmann  and  associates  (Sothmann  et  al.,  1992)  measured  exercise  heart  rate  and  oxygen 
uptake  on  10  male  fire  fighters.  Their  data  showed  that  firefighters  worked  at  an  average  of  88  per¬ 
cent  (±  6%)  of  their  measured  maximum  heart  rate  for  an  average  duration  of  15  (±7)  minutes.  The 
average  energy  cost  of  the  firefighter  emergency  work  task  was  a  VO2  of  25.6  ±  8.7  mlAg/min, 
representing  an  intensity  of  63  percent  (±  14%)  of  V02max. 

Sothmann  and  associates  (Sothmann  et  al.,  1990)  examined  the  relationship  between  V02max 
and  firefighting  work  tasks.  A  seven-item,  content-valid  fire  suppression  test  was  administered  to 
20  experienced  fire  fighters.  The  average  energy  cost  of  the  firefighter  simulation  tests  was  30.5  (± 
5.6)  ml/kg/min.  The  work  simulation  required  the  firefighters  to  work  at  an  intensity  of  76  percent 
(±  8)  of  V02max.  The  correlation  between  the  elapsed  time  required  to  complete  the  firefighter 
work  simulation  test  and  measured  V02max  was  -0.55.  In  a  cross-validation  study  with  32  differ¬ 
ent  male  firefighters,  successful  work  simulation  performance  depended  on  V02max.  Of  the  32 
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tested,  seven  firefighters  coidd  not  complete  the  work  sample  tests.  The  V02max  of  five  of  the 
seven  was  below  33.5  ml/kg/min. 

Highway  Patrol  Officers 

With  an  increasing  number  of  women  seeking  employment  as  highway  patrol  officers,  the 
objective  of  the  study  published  by  Wilmore  and  Davis  (1979)  was  to  find  the  minimum  physical 
qualifications  and  develop  a  job-related  preemployment  test.  They  administered  three  different  bat¬ 
teries  of  tests  to  140  male  and  16  female  patrol  officers.  The  laboratory  and  field  test  batteries 
included  strength,  flexibility,  body  composition,  and  cardiorespiratory  endurance  items.  The  job 
sample  tests  included  a  barrier  surmount  and  arrest  simulation,  and  a  dummy  drag  that  simulated 
dragging  an  injured  victim  50  feet  to  safety. 

The  major  differences  between  the  field  and  laboratory  batteries  were  that  the  1.5  mile  run 
replaced  the  maximum  treadmill  test,  and  body  fat  was  estimated  from  skinfolds  rather  then  meas¬ 
ured  by  hydrostatic  weighing.  The  laboratory  test  battery  was  significantly  correlated  with  the 
dummy  drag  (R=  0.66)  and  barrier  surmount  and  arrest  simulation  tests  (R=  0.68).  Replacing  the 
laboratory  tests  with  the  field  tests  resulted  in  slightly  lower  correlations,  0.57  for  the  dummy  drag, 
and  0.62  for  the  barrier  surmount  and  arrest  simulation  tests.  Although  the  fitness  tests  estimated 
work  simulation  test  performance,  test  performance  was  not  related  to  job  performance  consisting 
of  supervisor  ratings  on  16  critical  job  tasks. 

The  data  analysis  showed  that  the  officers  were  similar  to  the  normal  population  in  strength,  body 
fat,  flexibility,  and  cardiorespiratory  endurance.  An  important  result  of  the  study  was  that  the  pre¬ 
dominantly  sedentary  nature  of  the  officer’s  job  led  to  a  rapid  deterioration  in  physical  fitness  follow¬ 
ing  his  or  her  academic  training,  suggesting  the  need  for  an  in-service  physical  conditioning  program. 


Steel  Workers 

Arnold  and  associates  (Arnold  et  al.,  1982)  developed  a  preemployment  test  for  selecting  entry- 
level  steel  workers.  The  task  analysis  documented  that  entry-level  steel  workers  must  do  several  dif¬ 
ferent  physically  demanding  tasks.  The  investigators  used  a  combination  of  content-and  construct- 
validation  strategies.  The  job  analysis  identified  the  physically  demanding  work  tasks  required  of 
the  entry-level  workers  and  categorized  them  by  Fleishman’s  constructs  of  static  strength,  dynam¬ 
ic  strength,  and  endurance  (Fleishman,  1964).  The  selected  candidate  physical  performance  tests 
were  those  that  theoretically  measured  these  constructs. 

The  objective  of  the  study  was  to  determine  whether  the  physical  performance  tests  were  related  to 
the  work-sample  tests  developed  from  the  job  analysis.  The  sample  included  168  men  and  81  women 
who  were  in  their  first  6  months  of  employment  at  three  different  plant  locations.  The  job  analysis 
showed  that  work  tasks  differed  somewhat  across  the  3  sites,  resulting  in  11  work  sample  tests  at  1  site 
and  12  at  the  other  2  sites.  The  average  work-sample  test  performance  was  the  criterion  of  work  per¬ 
formance.  In  addition  to  the  work-sample  tests,  each  subject  completed  10  physical  performance  tests 
sampling  strength,  flexibility,  agility,  balance,  and  cardiorespiratory  endurance  dimensions. 
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Multiple  regression  selected  the  physical  performance  tests  most  highly  correlated  with  the 
work-sample  criterion.  For  all  three  work  sites,  arm  dynamometer  strength  was  the  most  important 
predictor  of  work-sample  test  performance.  The  zero-order  correlations  between  arm  strength  and 
work-sample  test  performance  were  consistently  high — 0.82,  0.85,  and  0.85  for  the  three  sites. 
Adding  two  more  tests  to  the  multiple  regression  models  added  little  to  the  validity;  the  multiple 
correlations  for  the  three  predictor  models  increased  to  0.87,  0.88,  and  0.89. 

The  authors  completed  a  utility  analysis  for  the  single  arm  strength  test  (Hunter,  Schmidt,  & 
Hunter,  1979).  This  analysis  involved  estimating  the  money  the  company  would  save  by  hiring 
workers  who  could  do  the  work.  Utility  estimates  were  based  on  test  validity  and  the  monetary 
value  was  related  to  the  variability  of  work  perfiarmance.  Using  1982  wage  standards,  Arnold  and 
associates  estimated  that  using  the  single  arm  strength  test  to  select  employees  would  lead  to  a  sav¬ 
ings  of  about  $5,000  per  year  for  each  employee  selected.  Based  on  employees  hired,  the  estimated 
company  savings  were  more  than  $9  million  a  year. 

Underground  Goal  [Vlining 

A  job  analysis  showed  that  the  work  of  underground  coal  miners  was  physically  demanding  and 
that  the  work  could  be  represented  with  four  work  sample  tests  (Jackson  &  Osbum,  1983;  Jackson 
et  al.,  1991).  The  first  work-sample  simtalation  test,  roof  bolting,  measured  maximum  isokinetic 
torque  and  simulated  straightening  a  steel  roof  bolt.  The  block  carry  test  involved  lifting,  trans¬ 
porting,  and  placing  82 -pound  concrete  blocks  in  positions  commonly  used  to  build  retaining  walls 
in  the  mine.  The  shoveling  simulation  test  involved  shoveling  polyvinyl  chloride  from  the  floor  over 
a  3.5-foot  wall.  Polyvinyl  chloride  has  the  same  density  of  coal,  and  the  task  was  to  shovel  800 
pounds  at  a  rate  consistent  with  the  subject’s  fitness.  The  bag  carry  simulation  test  measured  the 
number  of  50-pound  bags  that  were  lifted  and  transported  9  feet  during  a  5-minute  period. 

The  four  work-sample  tests  and  three  isometric  strength  tests  (grip,  arm  lift,  and  torso  lift) 
(NIOSH,  1977)  were  administered  to  25  male  and  25  female  subjects.  The  validation  strategy  was 
similar  to  that  followed  by  Arnold  and  associates  with  steelworkers  (Arnold  et  al.,  1982).  The  cor¬ 
relations  between  the  sum  of  the  isometric  strength  tests  and  four  work-sample  tests  ranged  from 
0.68  for  the  bag  carry  test  to  0.91  for  the  roof  bolting  test.  Multiple  regression  analysis  showed  that 
neither  gender  nor  the  gender-by-isometric  strength  interaction  accounted  for  the  additional  sig¬ 
nificant  variance.  This  showed  that  a  common  male  and  female  regression  line  defined  the  rela¬ 
tionship  between  strength  and  work-sample  test  performance. 

Both  exercise  heart  rate  and  rating  of  perceived  exertion  data  showed  that  the  shoveling  and  bag 
carry  tests  had  significant  aerobic  components  (Jackson  et  al.,  1991).  In  addition  to  the  isometric 
strength  tests,  the  subject’s  maximal  arm  cranking  oxygen  uptake  was  metabolically  determined.  The 
zero-order  correlations  between  the  sum  of  isometric  strength  and  the  work-sample  shoveling  and 
bag  carry  tests  were  higher  than  the  correlations  found  with  arm  V02max  (ml/min).  The  strength 
correlations  were  0.71  for  shoveling  and  0.63  for  the  bag  carry  test,  compared  with  0.68  and  0.46  for 
arm  V02max  (ml/min).  Multiple  regression  analysis  showed  that  arm  V02max  accounted  for  an 
additional  9  percent  of  shoveling  variance  beyond  that  of  isometric  strength  but  did  not  account  for 
additional  bag  carry  variance.  Polynomial  regression  analysis  showed  that  the  relationship  between 
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these  two  endvirance  work-sample  tests  and  isometric  strength  was  quadratic,  not  linear.  Strength 
was  more  important  for  differentiating  among  work  sample  performance  at  the  lowest  levels. 

Chemica!  Plant  Workers 

Job  analyses  documented  that  the  ph)^ically  demanding  tasks  required  of  chemical  and  refining 
plants  workers  included  cracking,  opening,  and  closing  valves  (Jackson,  Osbum,  Laughery,  & 
Vaubel,  1990;  Osbum,  1977).  Osbum  (1977)  developed  a  valve-turning  work-simulation  test 
administered  on  a  specially  developed  ergometer  consisting  of  a  disc  brake  mechanism  turned  by  a 
12-inch  value  handwheel. The  unit  was  calibrated  to  a  power  output  of  1,413.5  foot-pounds/minute. 
The  objective  of  the  work-sample  test  was  to  complete  250  revolutions  in  15  minutes^  The  job  analy¬ 
sis  showed  this  level  of  work  would  open  or  close  75  percent  of  the  emergency  valves  in  15  minutes. 

The  distribution  of  the  valve-turning  test  was  bimodal.  Physically  fit  workers  easily  completed 
the  15-minute  test,  but  the  test  was  too  demanding  for  many,  who  stopped  before  reaching  50  rev¬ 
olutions  (Jackson  et  al.,  1990).  The  test  elicited  maximal  cardiovascular  responses  in  many  appli¬ 
cants  (Osburn,  1977).  This  result  led  to  a  second  study  designed  to  determine  whether  isometric 
strength  tests  validly  predicted  valve-turning  performance  (Jackson,  1987;  Jackson  et  al.,  1992). 
The  valve- turning  work-sample  test,  and  three  isometric  strength  tests  (grip,  arm  lift,  and  torso  lift) 
were  administered  to  26  men  and  25  women.  The  zero-order  correlation  between  the  tests  was 
0.82.  Because  of  the  bimodal  shape  of  the  valve-turning  distribution,  a  logistic  regression  model 
(Pedhauzur,  1997)  defined  the  probability  of  completing  the  test  by  levels  of  isometric  strength. 
The  logistic  equations  and  probability  curves  are  published  (Jackson  et  al.,  1992). 

In  a  second  study,  a  task  analysis  questionnaire  completed  by  operators  at  a  major  chemical 
plant  identified  valve  cracking  as  the  most  physically  demanding  work  task  (Jackson  et  al.,  1990). 
An  electronic  load  cell  measured  the  peak  cracking  torque  on  217  randomly  selected  valves  in  the 
plant.  The  sampled  valves  included  those  with  horizontal  and  vertical  orientations,  positioned  close 
to  the  ground  and  overhead,  those  in  awkward  or  hard  to  reach  positions,  and  valves  of  various 
sizes.  The  results  of  this  biomechanical  job  analysis  showed  that  100  pounds  of  force  applied  to  the 
end  of  a  36-inch  valve  wrench  generated  sufficient  torque  to  crack  93  percent  of  the  plant  valves. 

A  valve-cracking  work-sample  test  simulated  cracking  valves  in  eight  different  ways.  The  eight 
cracking  torques  were  obtained  by  varying  the  action  (push  and  puU),  direction  (horizontal  and  ver¬ 
tical),  and  height  (high  and  low).  A  computerized  torque  wrench  measured  the  torque  applied  to 
four  nuts  placed  in  vertical  and  horizontal  positions  at  two  heights. 

The  valve- cracking  test  and  isometric  strength  tests  (grip,  arm  lift,  and  torso  lift)  were  admin¬ 
istered  to  118  men  and  66  women.  The  intercorrelations  among  the  eight  measures  of  valve- crack¬ 
ing  torque  were  high,  ranging  from  0.66  to  0.89.  Because  of  the  high  intercorrelations,  the  eight 
valve- cracking  scores  were  averaged  and  used  as  the  work-sample  measure.  The  correlation  between 
the  sum  of  the  three  isometric  strength  tests  and  average  valve-cracking  torque  was  0.65.  A  logis¬ 
tic  regression  equation  (Pedhauzur,  1997)  defined  a  probability  model  for  estimating  the  chances 
of  generating  the  100-pound  criterion  for  levels  of  isometric  strength.  These  data  are  pubhshed 
elsewhere  (Jackson  et  al.,  1992). 
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Doolittle  and  associates  (Doolittle  et  al.,  1988)  developed  a  preemployment  test  for  selecting 
electrical  transmission  lineworkers.  The  study  included  an  extensive  job  analysis  of  electrical  trans¬ 
mission  lineworker  jobs.  The  initial  stage  of  the  task  analysis  surveyed  workers  using  scales 
designed  to  answer  three  questions — 

1.  How  often  was  each  task  performed? 

2.  How  much  time  was  spent  completing  each  task? 

3.  How  physically  demanding  was  each  task  for  the  individual? 

The  identified  critical,  physically  demanding  tasks  were  studied  in  detail  to  define  the  forces 
needed  to  perform  them  safely  and  efficiently.  This  involved  defining  standard  anatomical  move¬ 
ments  for  lifting,  pushing,  and  hoisting;  measuring  the  masses  lifted  and  forces  exerted;  and  esti¬ 
mating  the  metabolic  costs  of  various  work  tasks. 

Using  the  task  analysis  data,  5  strength  tests  ffiat  duplicated  the  muscular  actions  were  selected 
and  administered  to  48  incumbents.  The  tests  required  the  subject  to  move  a  weight  that  represent¬ 
ed  loads  that  linemen  moved.  The  weights  ranged  from  7  to  61  kilograms.  The  final  two  tests  select¬ 
ed  were  chin-ups  and  V02max  estimated  from  bench  stepping  and  exercise  heart  rate.The  seven  tests 
were  combined  into  a  single  performance  measure.  Criterion-related  validity  was  examined  by  com¬ 
paring  physical  test  performance  with  two  criteria,  supervisor  ratings  and  accident  rates.  The  crew 
chiefs  confidentially  evaluated  each  incumbent  on  the  followng  six  dimensions  of  job  performance — 

1.  productivity, 

2.  working  with  others, 

3.  supervision, 

4.  safety, 

5.  physical  ability,  and 

6.  technical  skills. 

The  correlations  between  the  composite  physical  test  criteria  of  supervisor  ratings  and  lost  work 
days  because  of  on-the-job  injuries  averaged  over  5  years  were  0.59  and  0.46. 


Diver  Training 

Two  validation  studies  (Gunderson,  Rahe,  &  Arthur,  1972;  Hogan,  1985)  were  designed  to  esti¬ 
mate  successful  completion  of  Military  xmderwater  diver  training  programs.  Gunderson  and  associ¬ 
ates  (Gunderson  et  al.,  1972)  used  successful  completion  of  underwater  demolition  training  as  the  cri¬ 
terion  of  performance.  They  found  a  multiple  correlation  of  0.54  between  success  defined  by  the  com¬ 
pletion  of  training  and  five  variables,  squat-jumps,  puU-ups,  sit-ups,  body  weight,  and  the  Cornell 
Medical  Index.  Using  these  tests,  they  predicted  about  70  percent  of  those  who  passed  training. 
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Hogan  (Hogan,  1985)  used  46  male,  naval  personnel  who  volunteered  for  diver  training.  The 
first  criteria  was  success  included  nine  performance  rating  scales  that  reflected  physical  condition, 
swimming  training,  leadership  potential,  teamwork,  and  overall  performance.  The  second  criteria 
was  successful  completion  of  training.  The  predictor  measures  included  3  anthropometric  measure¬ 
ments  £ind  23  fitness  tests.  Hogan  reported  a  multiple  correlation  of  0.63  between  the  average  per¬ 
formance  rating  and  three  physical  tests,  1-mile  run,  sit  and  reach,  and  muscular  endurance  meas¬ 
ured  with  an  arm  ergometer.  The  multiple  correlation  between  these  three  tests  and  successful  com¬ 
pletion  of  the  course  was  0.64.  Hogan  suggested  that  the  validity  coefficients  were  likely  an  overes¬ 
timate  because  of  an  unfavorable  ratio  of  the  number  variables  and  subjects  (Pedhauzur,  1997). 

Demanding  IVlilitary  Jobs 

The  U.S.  Military  Services  examined  methods  of  matching  enlisted  personnel  with  physically 
demanding  jobs.  The  U.S.  Air  Force  adopted  a  pre-induction  dynamic  one-repetition  maximum 
(1-RM)  strength  test  (Ayoub  et  al.,  1982).  The  U.S.  Army  and  U.S.  Navy  examined  the  relation¬ 
ship  between  body  composition  variables  and  physically  demanding  work  tasks  (Marriott  & 
Gmmstmp- Scott,  1992). 

The  U.S.  Air  Force  developed  a  Strength  Aptitude  Test  (SAT)  to  match  the  general  strength 
abilities  of  individuals  wth  the  specific  strength  requirements  ofU.S.  Air  Force  jobs  filled  by  enlist¬ 
ed  personnel  (Ayoub  et  al.,  1982). The  U.S.  Air  Force  SAT  measures  the  subject’s  voluntary  1-RM 
lift  to  a  height  of  6  feet.  The  SAT  starts  with  a  40-pound  lift.  The  lift  load  is  increased  by  10 
pounds  until  the  subject  reaches  his  or  her  maximum  voluntary  lift  or  a  maximum  weight  of  200 
pounds.  The  SAT  is  administered  to  U.S.  Air  Force  recmits  as  part  of  their  pre-induction  physical 
examination.  Each  enlisted  U.S.  Air  Force  career  field  has  a  prerequisite  SAT  cut-score. 

An  area  of  concern  expressed  by  the  Committee  on  Military  Nutrition  Research  of  the  Institute 
of  Medicine,  National  Academy  of  Sciences,  is  the  role  body  composition  plays  in  physical  per¬ 
formance.  This  relationship  is  important  not  only  for  making  decisions  about  acceptance  or  rejec¬ 
tion  of  recruits  for  the  Military  Service  but  also  for  retention  and  advancement  while  in  the  Service 
(Marriott  &  Gmmpstmp-Scott,  1992).  Hodgdon  and  associates  (Hodgdon,  1992)  examined  the 
relationship  between  body  composition,  fitness,  and  materials-handling  tasks  required  of  naval 
enlisted  men.  The  two  materials-handling  tasks  were  the  maximum  box  weight  that  could  be  lift¬ 
ed  to  elbow  height  and  the  total  distance  a  34-kilogram  box  could  be  carried  during  two,  5-minute 
workouts.  The  variables  most  highly  correlated  with  maximum  box  lift  were  push-ups  (r  =  0.63) 
and  fat-free  mass  (r  =  0.80). The  variables  most  highly  correlated  with  the  box  carry  test  were  push¬ 
ups  (r  =  0.56),  1.5-mile  mn  time  (r  =  -0.67),  and  fat-free  mass  (r  =  0.44).  Fat-free  mass  was  high¬ 
ly  correlated  with  muscular  strength  measures,  suggesting  the  possibility  of  using  fat-free  mass  as 
an  approximation  of  general  strength  in  job  assignment. 

Vogel  and  Friedl  (Vogel  &c  Friedl,  1992)  examined  the  relationship  between  body  composition 
and  absolute  lifting  capacity.  They  reported  significant  correlations  between  maximum  lifting 
capacity  and  fat-free  mass  for  male  and  female  soldiers.  Although  they  did  not  test  for  homogene¬ 
ity  of  male  and  female  regression  lines,  they  published  separate  equations  for  men  and  women. 
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A  limitation  of  Military  testing  programs  is  the  lack  of  job-related  materials-handling  per¬ 
formance  tests.  While  recognizing  the  need  to  develop  content-valid  tests,  the  Committee  on 
Military  Nutrition  Research  concluded  that  there  was  a  direct  relationship  between  Military  mate¬ 
rials-handling  tasks  and  fat-firee  mass.  In  view  of  this  relationship  and  the  lack  of  job-related  tests, 
the  Military  should  seriously  consider  establishing  a  minimum  standard  for  fat-firee  mass  (Marriott 
&  Grumpstmp-Scott,  1992).  Such  a  recommendation  might  be  implemented  for  the  Military,  but 
using  body  composition  variables  in  pre- employment  tests  in  the  private  sector  would  likely  meet 
an  immediate  legal  challenge. 

Rayson  and  associates  (Rayson  et  al.,  2000a;  Rayson  et  al.,  2000b)  completed  a  major  criterion- 
related  validation  study  for  the  British  army.  They  examined  the  effectiveness  of  the  British  army’s 
Physical  Standards  for  Recruits  (PSS(R))  in  predicting  criteria  measuring  recmit  success  in  basic 
training.  The  PSS(R)  consisted  of  tests  measuring  body  mass,  body  composition,  strength,  and 
endurance.  The  criteria  included — 

1.  four  representative  Military  tasks  (RMT)  consisting  of  a  single  lift,  carry,  repetitive  lift, 
and  loaded  march, 

2.  the  days  lost  to  injury  and  sickness  during  basic  training, 

3.  degree  of  success  of  basic  training,  and 

4.  job  performance  ratings  by  self,  peer,  and  supervisor. 

The  PSS(R)  tests  were  administered  to  more  than  1,000  recruits  (770  males  and  239  females)  prior  to 
starting  basic  training,  and  the  army  job  performance  criteria  were  obtained  at  the  end  of  basic  training. 

The  PSS(R)  tests  correctly  predicted  outcomes  on  the  RMTs  for  74.9  percent  of  the  recruits, 
of  which  58.7  percent  were  true  positives  and  16.2  percent  were  true  negatives.  Of  the  25.1  percent 
misclassified,  15.5  percent  were  false  positives  and  9.6  percent  were  false  negatives.  The  false  neg¬ 
atives  were  those  recruits  predicted  by  the  PSS(R)  tests  to  fail  the  four  RMTs  when  they  did  pass 
the  tasks.  Although  data  were  not  presented,  the  authors  indicated  that  most  of  the  female  mis- 
classifications  were  false  positives,  women  being  incorrectly  accepted  rather  than  incorrectly 
rejected  from  the  army.”  A  significant  relationship  was  found  between  training  outcome  and  pass¬ 
ing  the  PSS(R)  tests.  Additionally,  the  PSS(R)  tests  were  significantly  related  to  days  lost  because 
of  injury  and  sickness  during  basic  training.  Those  recmits  who  failed  their  selection  outcome  lost 
a  median  of  2  days  compared  with  no  days  for  the  recruits  who  passed.  Although  not  statistically 
significant,  the  performance  ratings  of  those  who  failed  the  selection  tests  were  consistently  lower 
then  those  who  passed  the  tests.  The  authors  concluded  that  the  PSS(R)  were  valid,  useful  predic¬ 
tors  of  British  army  performance. . 


Manual  Lifting  Tasks 

Manual  lifting  tasks  are  common  elements  of  many  jobs.  Manual  lifting  tasks  have  been  stud¬ 
ied  extensively.  The  reason  for  this  popularity  is  the  large  number  of  job  that  include  materials- 
handling  tasks  and  the  injury  risk  associated  with  lifting.  It  is  estimated  that  about  50  percent  of 
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all  industrial  back  injuries  are  caused  by  lifting,  and  about  67  percent  of  the  injuries  are  caused  by 
lifting  loads  that  are  too  difficult  for  industrial  workers  (Snook  et  al.,  1978). 

An  established  ergonomic  injury-reduction  strategy  is  to  match  the  worker  with  the  demands 
of  the  lifting  task.  One  major  approach  is  to  engineer  the  stress  out  of  the  task.  This  approach 
defines  the  lift  weights  that  are  within  the  physiological  capacity  of  most  industrial  workers 
(Ayoub,  1982).  The  first  research-based  strategy  used  psychophysical  methods  to  define  the  lift 
weight  perceived  as  acceptable  to  75  percent  of  industrial  workers.  Snook  and  associates  (Snook  8c 
Ciriello,  1974;  Snook  8c  CirieUo,  1991;  Snook,  Irvine,  8c  Bass,  1970)  published  separate  standards 
for  males  and  females.  The  maximum  acceptable  lift  weight  for  females  was  about  50  percent  of  the 
lift  weights  for  males.  A  newer  strategy  is  the  use  of  the  NIOSH  multiplicative  equations  (NIOSH, 
1981;  Waters  et  al.,  1993)  that  consider  several  different  lift  difficulty  parameters.  The  NIOSH 
equations  extend  the  Snook  and  associates’  psychophysical  methodology  by  also  using  biomechan¬ 
ical  and  physiological  criteria  to  define  recommended  weight  of  lift  (RWL).The  newest  NIOSH 
equation  (Waters  et  al.,  1993)  defines  a  RWL  that  would  be  acceptable  to  75  percent  of  the  female 
industrial  population.  Using  the  75'*'  percentile  female  as  the  RWL  criterion  produces  a  conserva¬ 
tive  estimate.  The  RWL  for  the  common  floor  to  knuckle  lift  at  a  frequency  of  one  lift  every  30 
minutes,  for  example,  is  only  10  kilograms  or  22  pounds  (Waters  et  al.,  1993). 

The  NIOSH  equation  focuses  on  job  design,  i.e.,  defining  a  RWL  for  most  male  (99  percent) 
and  female  (75  percent)  industrial  workers  for  all  ages  in  the  workforce.  A  limitation  of  the  NIOSH 
equation  is  that  it  does  not  consider  individual  difierences  in  physiological  capacity  of  workers. 
Many  common  materials-handling  tasks  exceed  the  NIOSH  equation’s  RWL  estimates.  The  sec¬ 
ond  ergonomic  method  of  matching  the  worker  with  the  demands  of  job  is  to  select  individuals 
with  the  physiological  capacity  to  do  the  job  with  a  margin  of  safety  (Ayoub,  1982;  Keyserling  8c 
al.,  1980;  NIOSH,  1977). 

The  content-validation  method  is  often  used  to  validate  materials-handling  tests.  A  content- 
valid  test  would  be  to  have  the  applicant  perform  the  task,  e.g.,  lift  a  90-pound  jackhammer  and 
transport  it  a  specified  distance.  Although  this  type  of  test  would  be  content  valid,  it  has  two  lim¬ 
itations.  First,  it  is  not  possible  to  determine  one’s  maximum  capacity.  Second,  motivated  applicants 
without  the  physiological  capacity  demanded  by  the  task  place  themselves  at  risk  of  injtury  (Ayoub, 
1982).  One  of  the  first  ergonomic  approaches  used  to  overcome  these  hmitations  was  to  use  iso¬ 
metric  strength  tests  that  duplicated  the  position  assumed  by  the  worker  to  do  the  lift.  These  posi¬ 
tion-specific  strength  data  were  used  to  determine  whether  an  applicant  had  sufficient  strength 
capacity  to  do  the  work  with  a  margin  of  safety  (KeyserHng  8c  al.,  1980;  Keyserling  et  al.,  1980). 

GUliam  and  Lund  (2000)  examined  the  effects  on  work-related  injuries  of  physiologically 
matching  workers  to  the  demands  of  the  job.  Isokinetic  strength  was  measured  on  365  applicants 
for  track  driver  and  dockworker  jobs.  The  isokinetic  data  were  used  to  generate  a  Department  of 
Labor  Dictionary  of  Occupational  Titles  strength  rating.  This  rating  was  used  to  select  apphcants 
who  matched  the  physical  demands  of  the  job.  Of  the  365  applicants,  276  matched  the  job  demands 
and  were  hired.  The  89  applicants  who  did  not  match  were  not  hired.  Those  hired  were  significant¬ 
ly  stronger  then  those  who  were  not  hired.  In  addition,  those  not  hired  were  significantly  heavier 
then  those  hired.  Those  not  hired  were  44  pounds  heavier  then  the  new  hires.The  injury  rates  of  the 
strength-matched  new  hires  were  compared  with  historical  data  on  workers  matched  for  employ¬ 
ment  duration.  The  overexertion  injury  rates  to  the  knees,  shoulders,  and  back  were  1.04  for  the 
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strength-matched  workers  compared  with  16.7  for  the  non-matched  workers,  suggesting  that  pre¬ 
employment  screening  is  effective  in  reducing  injury.  Although  not  examined,  these  results  also  sug¬ 
gest  that  body  composition  may  also  have  been  a  factor.  A  strength-weight  profile  of  weaker  and 
heavier  versus  stronger  and  lighter  suggest  a  difference  in  percent  body  fat.  The  stronger-lighter  pro¬ 
file  is  consistent  with  a  lower  percent  body  fat,  which  also  might  have  been  an  injury  risk  factor. 

Another  physiological  approach  to  matching  the  worker  to  the  demands  of  the  job  is  to  use  stan¬ 
dard  strength  tests  to  assess  an  individuals  ph)rsiological  capacity  and  use  regression  models  to  define 
the  probability  of  being  able  to  complete  a  lift  (Jackson  Sc  Sekula,  1999;  Jackson,  Borg,  Zhang, 
Laughery,  Sc  Chen,  1997).  This  approach  was  used  to  study  hospital  workers  involved  with  lifting 
and  transporting  patients.  An  analysis  of  hospital  jobs  documented  that  patient  lifting  was  a 
demanding  lift  task  (Jackson,  Osbum,  Laughery,  Young,  Sc  Zhang,  1994).  Patient  lift  tasks  are  a 
major  source  of  injury  to  the  lifter  (Garg  &Owen,  1992).  The  lift  dimensions  of  the  most  common 
single-person  patient  lift  were  used  to  devise  a  work-sample  lift  test.  The  most  common  patient  lift 
task  is  lifting  a  patient  who  is  sitting  in  a  chair.The  simulated  lift  test  consisted  of  lifting  a  box  firom 
a  height  of  53  cm  to  a  height  of  48  cm.  The  hand  position  at  the  start  of  the  lift  was  at  a  height  that 
the  lifter  would  grab  a  patient  sitting  in  a  chair.The  lift  task  consisted  of  lifting  seven  loads  ranging 
in  weight  from  15  to  90  pounds.  The  subjects  lifted  those  loads  that  were  within  their  capacity  and 
rated  lift  difficulty  with  Borg’s  CR— 10  psychophysical  scale  (Borg,  1982;  Borg,  1998).  Logistic 
regression  analysis  of  the  data  on  58  female  and  33  male  subjects  showed  that  the  capacity  to  com¬ 
plete  a  lift  depended  on  the  lifter’s  physiological  capacity  sampled  by  his  or  her  isometric  strength 
and  fat-free  mass.  Further  analyses  showed  that  the  subject’s  CR-10  rating  of  each  lift  was  signifi¬ 
cantly  correlated  with  isometric  arm,  shoulder,  torso,  and  leg  strength,  and  fat-free  weight. 

The  results  of  the  patient  lift  study  suggested  that  lift  weight  and  the  physiological  capacity  of 
the  lifter  could  be  used  to  develop  a  generalized  lift  model.  The  second  study  examined  the  role  of 
lift  load,  strength,  and  gender  on  psychophysical  lift  capacity  (Jackson,  1999).  A  floor-to-knudde 
lift  test  was  administered  to  209  men  and  181  women.  The  task  involved  lifting  loads  ranging  from 
22  to  143  pounds.  The  subject  started  with  a  light  lift  load  and  continued  to  lift  heavier  loads  until 
either  the  heaviest  load  was  lifted  or  the  subject  failed  the  lift.  The  load  increased  at  a  linear  rate  of 
11  pounds.  After  each  completed  lift,  the  subject  rated  the  lift  difficulty  with  Borg’s  CR— 10  scale 
(Borg,  1998).  The  subject’s  physiological  strength  capacity  was  measured  with  basic  isometric 
strength  tests  (Baumgartner  6c  Jackson,  1999).  Each  subject’s  dynamic  lift  profile  was  defined  with 
a  power  function  regression  equation  using  the  completed  lift  weight  as  the  independent  variable 
and  the  CR-10  rating  as  the  dependent  variable.  Using  the  power  function  regression  equation,  one 
lift  weight  and  the  associated  CR— 10  rating  were  randomly  selected  for  each  subject.  This  created 
a  distribution  of  lift  weights  and  associated  psychophysical  ratings  ranging  from  very  easy  to  the 
maximum  within  the  subject’s  psychoph)^ical  capacity.  Multiple  regression  provided  an  equation 
with  a  function  to  estimate  psychophysical  lift  difficulty  from  lift  load,  strength,  and  the  gender- 
by-weight  load  interaction.  The  multiple  correlation  for  the  model  was  0.81,  with  a  standard  error 
of  1.7  CR-10  units.  The  derived  equation  provided  a  model  that  defined  the  psychophysical  lift 
demands  of  common  industrial  weight  loads  for  individuals  who  differed  in  physiological  capacity. 

The  psychophysical  modeling  of  industrial  lift  tasks  not  only  provides  evidence  concerning  an 
individual’s  probability  of  being  able  to  complete  a  lift  but  also  psychophysical  stress.  The  psy¬ 
chophysical  demand  of  a  lift  task  is  related  to  the  risk  of  back  injury  (Herrin,  1986;  Liles,  1984; 
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Snook  et  al.,  1978).  Lifting  loads  psychophysicaUy  judged  to  be  difficult  increases  the  risk  of  injury. 
Psychophysical  ratings  provide  an  index  of  relative  demand  for  the  individual.  Resnik  (1995)  pres¬ 
ents  preliminary  data  showing  that  Borg’s  psychophysical  rating  can  be  interpreted  by  the  physio¬ 
logical  significant  scale  of  percentage  of  maximum  capacity.  With  a  sample  of  254  male  and  354 
female  subjects,  a  correlation  of  0.91  was  obtained  between  Borg’s  CR— 10  rating  and  the  subject’s 
maximum  function  lift  capacity  (Sekula,  Jackson,  &  Laughlin,  under  review).  Maximum  function¬ 
al  lift  capacity  was  the  subject’s  percentage  of  maximum  lift,  where  maximum  lift  represented  the 
weight  load  equal  to  the  subject’s  Borg  psychophysical  CR-10  rating  of  10.  A  regression  equation 
was  developed  to  convert  CR-10  ratings  into  the  metric  of  percentage  of  max.  The  standard  error 
of  estimate  for  the  linear  equation  was  8.5  percent  max.  This  research  could  provide  researchers 
with  the  capacity  to  interpret  psychophysicaUy  defined  lift  loads  with  the  weU-established  physio¬ 
logical  intensity  metric  of  percentage  of  maximum  capacity. 


Summary 


In  srmimary,  the  Uniform  Guidelines  require  validity  studies  to  be  carried  out  whenever  there  is 
a  need  to  continue  selection  practices  that  lead  to  adverse  impacts.  Three  types  of  validity  studies  are 
recognized:  content-validity,  criterion-related  validity,  and  constmct-validity  studies.  The  guidelines 
require  aU  validity  studies  to  be  carried  out  in  a  responsible,  scientificaUy  sound  maimer,  and  caU  for 
the  use  of  good  judgment  in  the  implementation  of  selection  procedures.  The  EEOC  is  waiting  for 
developments  in  the  field  before  it  completely  endorses  construct-validity  studies.  A  major  differ¬ 
ence  in  physical  test  validation  is  the  use  of  physiological  rather  then  psychological  tests.The  goal  of 
physiological  validation  is  to  define  the  physiological  capacity  needed  by  a  worker  to  perform  the 
work  demanded  by  the  task.  Principal  features  of  the  physiological  validation  approach  are  the  use 
of  a  physiological  metric  to  quantify  test  performance  and  the  interpretation  of  validity  results  using 
relevant  physiological  research  and  theory.  These  data  are  used  to  develop  physiologically  sound  cut- 
scores.  Although  numerous  physical  test  validation  studies  have  been  completed,  most  are  not  pub¬ 
lished.  The  results  of  those  published  shows  that  physical  tests  can  be  used  to  select  workers  with  the 
physiological  capacity  to  do  demanding  jobs.  Ergonomic  research  shows  that  selecting  workers  with 
the  physiological  capacity  to  do  the  work  reduces  the  risk  of  work-related  injuries. 


Endnotes 


Y  '-criterion 


1.  The  probability  can  be  estimated  with  the  following  equation:  Z  =  err^of  estimate  > 

where  Y’  is  the  estimated  criterion  score  and  the  criterion  is  the  desired  value,  in  this  example,  100. 
Once  the  z-score  is  obtained,  a  table  of  normal  curves  can  be  used  to  estimate  the  proportion  of  sub¬ 
jects  that  can  be  expected  to  exceed  the  criterion  for  a  given  strength  level. 

2.  Personal  communication  between  A.  Jackson,  University  of  Houston,  and  Dr.  John  Hater  of  the  Fedex 
Corporation.  Engineers  used  the  power  output  data  in  Figure  5.3  to  estimate  expected  changes  in  pro¬ 
ductivity  produced  by  changes  the  physiological  capacity  of  the  workforce. 
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3.  This  review  was  initially  published  in  1994  by  one  of  the  authors  of  this  chapter  (Jackson,  1994)  and 
expanded  to  include  studies  published  since  that  time. 
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